LLMSanitize" A Comprehensive Survey Of Data Contamination Areas In Large-scale Language Models

Large Language Models 28/10/2024

3 main points
✔️ First comprehensive review of the field of contamination detection in large-scale language models
✔️ Categorizes contamination into data contamination and model contamination and organizes existing research in each category
✔️ Introduces LLMSanitize, an open source Python library for sharing contamination detection methods

How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library
written by Mathieu Ravaut, Bosheng Ding, Fangkai Jiao, Hailin Chen, Xingxuan Li, Ruochen Zhao, Chengwei Qin, Caiming Xiong, Shafiq Joty
(Submitted on 31 Mar 2024)
Comments: 8 pages, 1 figure, 1 table
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

With the rapid evolution of artificial intelligence, large-scale language modelshave become an important tool for innovation in areas as diverse asnatural language processing,automaticcontentgeneration, complex decision-making systems, andautonomous agents. These models are trained on huge data sets and can produce natural answers that are indistinguishable from those produced by humans. However, if the integrity of the dataset is compromised, the validity and reliability of the model can be seriously affected.

Contamination" is a major problem in this field. This contamination includes "data contamination," where evaluation results are skewed by the inclusion of evaluation data sets in the training set, and "model contamination," where the model sees the evaluation data sets beforehand. This can result in inaccurate model performance evaluations and possible bias. Further, depending on whether the contaminated data contains only inputs or both inputs and labels, it can be classified as "input contamination" or "input + label contamination. Model contamination detection methods can be divided into white box detection, which allows full access to the local model, and black box detection, which only allows API access.

Contamination poses challenges not only in terms of technical accuracy, but also in ethical and commercial terms. The risk of relying on contaminated data is especially great in areas where trust is required, such as medical diagnostics, legal advice, and financial services. As companies utilize AI for strategic decision making, assurance of data accuracy is critical.The reliability of the output oflarge-scale linguisticmodels also affects investor confidence, which in turn relates to technological superiority and financial prospects.

This situationcalls for a comprehensive survey and resource sharing of contamination detection inlarge-scale language models. This paper clarifies the scope and nature of contamination, identifying its sources, types, and impact on model performance. In doing so, it also emphasizes the importance of strategies to mitigate contamination risks and to ensure that large-scale language model deployments are equitable and economically sustainable.

The paper introduces methods and findings related to data contamination, model contamination issues, current and future challenges, best practices, and an open source Python library, LLMSanitize, that can share taint detection methods.

Data Contamination Review

The purpose of data contamination detection is to check whether D∩DE is empty given a training data set D and an evaluation data set DE. This is very important to ensure that the performance of the evaluation benchmark is not distorted by contaminated data.

Training reports for several large language models assess the level of contamination between downstream evaluation data sets and pre-training sets using the most basic data contamination assessment technique, string matching. This technique introduces the following

GPT-2 (Radford et al., 2019)
- Contamination is calculated as the percentage of 8-grams from the evaluation set that are included in the WebText training set. Overlap between the test set and the WebText dataset for the general language model dataset is 1-6%, with an average overlap of 3.2%.
GPT-3 (Brown et al., 2020)
- Removed Common Crawl (C4) dataset and data points with 13-gram duplicates. Massive data contamination was revealed, showing that the Wikipedia language modeling benchmark and SQuAD 2.0 were almost completely contaminated.
Dodge et al. (2021)
- The C4 dataset was examined to measure the content of the NLP task training and test datasets in the pre-training corpus. Contamination levels ranged from <2% to >50%.
PaLM (Chowdhery et al., 2023)
- Identified 10 evaluation datasets at risk of contamination and split the dataset into clean and contaminated subsets based on whether the training set contains more than 70% 8-grams As with GPT-3, the performance gap between the clean and contaminated sets is small.
GPT-4 (Achiam et al., 2023)
- Measures the degree of contamination between the evaluation set and the pre-training data by considering a 50-character substring randomly extracted from the evaluation data points to be a duplicate if it is a substring of the training set. Contamination has little impact on zero-shot results.
Llama-2 (Touvron et al., 2023)
- Tokens are used to measure contamination, and contamination is defined when it appears as 10 or more tokens n-grams in both the evaluation sample and the training set. Contamination level is measured by the percentage of contaminated tokens.
Li (2023b)
- Calculate the METEOR score between the matching page from CommonCrawl and the query from the Bing API and consider anything above 0.75 to be tainted. Contamination levels range from 1% to 47%.
Deng et al. (2023)
- The top 10 documents from the pre-training dataset were retrieved, split into 13-gram chunks, and the overlap with the chunks of the evaluation data points was calculated. As a result, TruthfulQA shows high overlap with the pre-training dataset.

Simple string matching is ineffective for paraphrased samples. However, calculating cosine similarity between embeddings is a more robust alternative for lexical changes.The following are some of the methods that have been introduced to this technique.

Lee et al. (2023)
- Prevent contamination in the Open-Platypus dataset by removing test questions that have more than 80% cosine similarity to the training items.
Phi-1 (Gunasekar et al., 2023)
- Show that embedding-based searches between code snippets are effective, and that n-gram duplicates fail.
Riddell et al. (2024)
- Provides a hybrid approach combining string matching and embedded similarity. Shows that widely used code generation benchmarks are tainted by pre-training sets.

String matching can have trouble detecting synthetic training data generated by large-scale language models, and semantic similarity searches using embedding have difficulty selecting appropriate similarity thresholds. Therefore, to improve robustness, a method has been proposed that uses thelarge-scale language modelitself to detect contamination.The following are some of the methods that have been introduced for this approach.

Yang et al. (2023b)
- The presence of reframed samples makes taint checking difficult. Detected using a combination of embedded similarity search and a large-scale language model (GPT-4).
Zhu et al. (2023)
- Introduced benchmark Clean-Eval to evaluate decontamination performance, showing that GPT-3.5 and Llama-2 performance degrades in Clean-Eval data, confirming contamination.

These studies provide a variety of approaches for data contamination detection and are an important step toward improving the reliability of assessment benchmarks.

Model Pollution Review

Model tainting is the application of a model M trained on the training set DM to the evaluation data set DE with thegoal of checkingwhether DM∩DE is empty.Exposure to evaluation data during training may affect the generalization ability of the model and artificially improve its performance. Compared to data contamination, the training set is model dependent and may be unknown, such as GPT-3.5 and GPT-4.

First, we review studies that assess model contamination through performance analysis, applying M to evaluation data sets with different time stamps and including data sets that were not present during the pre-training. This prevents contamination. Poor performance on these data points suggests contamination.

Roberts et al. (2023)
- Comparative analysis of the effects of GPT-4 and GPT-3.5-Turbo on code generation tasks using data sets from before and after specific cutoff dates, finding a significant positive correlation between likelihood of exposure to coding problem platforms such as Codeforces and Project Euler and test case pass rates. We found a significant positive correlation between the likelihood of exposure to coding problem platforms such as Codeforces and Project Euler and the rate at which test cases passed.
Li & Flanigan (2023)
- Presents four innovative methods for detecting task contamination in large-scale language models. These methods include training data inspection, task example extraction, membership inference, and chronological analysis. These methods provide a comprehensive toolkit for increasing the reliability of large-scale language models.
Jiang et al. (2024b)
- Conducted a comprehensive evaluation of data contamination in pre-training language models using various public data sets. Found a U-shaped relationship between the degree of data contamination and model performance, suggesting that existing n-gram-based contamination definitions are inadequate. Emphasized the need for more robust contamination detection methods.

An intuitive way to detect model contamination inlarge languagemodels is to carefully analyze the output in a controlled prompt setting. The output for contaminated data may be very similar or perfectly consistent with the training data, and model confidence may be very high.

Also, when analyzing model memory, researchers focus on the training data that the model has fully memorized and use this to identify contamination in downstream tasks.

Elangovan et al. (2021)
- He pointed out that random data shuffling, which is common in NLP tasks, does not account for the overlap between the training and test sets. This overlap leads to data leakage and overestimation of performance metrics. Adopted a bag-of-words approach to assessing textual similarity and provided a foundational method for identifying and mitigating leakage.
Lee et al. (2021)
- Validates the significant benefits of data set de-duplication, where de-duplication reduces the occurrence of full data storage by a factor of 10, mitigates training and testing overlap, and prevents overestimation of model accuracy. Achieves high accuracy quickly by reducing data size by up to 19%, improving training efficiency, and maintaining or improving model embarrassment. Streamlines the learning process without compromising data quality or model performance, contributing to environmentally and economically sustainable model development.
Magar & Schwartz (2022)
- They found evidence of contamination across multiple benchmarks and noted that it affects the performance of models such as GPT-3. A new approach to quantifying the impact of contamination involves training models on a mixture of general and task-specific data, and then measuring memory and utilization by comparing performance in instances that have been seen and those that have not. The methodology highlights the subtle relationship between memory and utilization, suggesting that large-scale models trained on specific data configurations may exhibit different sensitivities to contaminated data.
Gemma Team Google DeepMind (2024)
- Emphasizing the vulnerability of integrity models to recite memorized training data against adversarial attacks, we analyzed discoverable memories across 10,000 documents from the corpus of Gemma's pre-trained model using a method similar to that of Anil et al. (2023). The analysis distinguishes between "exact memory," in which the model-generated text exactly matches the source, and "approximate memory," as measured by a 10% edit distance threshold. The results of the analysis reveal that the Gemma model has a similar memory rate as the PaLM and PaLM-2 models.

Other researchers have developed elaborate prompting techniques to elicit data completion by large-scale language models, which may indicate contamination if the output is suspiciously close to the actual training data; due to the consistency of large-scale language models during RLHF, standard prompts for data completion has been reported to sometimes fail (Ouyang et al., 2022b).

Nasr et al. (2023)
- Reveals vulnerability of large-scale language models, including ChatGPT, to data extraction attacks, raising serious privacy issues. By prompting with input outside the training data, we show that ChatGPT reproduces parts of the training data and contains sensitive information at the individual level. The study shows that even models trained with integrity techniques aimed at reducing the transmission of stored data can leak significant amounts of sensitive information and calls for the development of more robust defenses. The impact of this research is far-reaching and underscores the importance of addressing the privacy vulnerabilities inherent in the deployment of large-scale language models.
Weller et al. (2023)
- Highlighted the potential for directing large-scale language models toward more factual content generation through effective prompting strategies. The addition of instructional phrases such as "According to," which prompts the user to cite from a specific corpus, showed improvements in grounding as measured by the Citation Information Accuracy (QUIP) score. The method demonstrates its versatility and effectiveness in a variety of domains and corpora, and its potential to generate more accurate and reliable responses using the stored knowledge of large language models.

In addition, likelihood-based methodstake advantage of the fact thatmodelstend to make thenext tokenprediction with higher confidenceif the relevant data is seen during training.

Min-K% Prob (Shi et al., 2024)
- Assumes white box access to the logit or next token probability distribution of the large-scale language model. Given a test sample X, the method runs the large-scale language model through all tokens of X, tracking the k% tokens with the lowest predicted probability. It then computes the average of these lowest probabilities and considers X tainted if this average is too high.
Oren et al. (2023)
- Unique approach for analyzing the likelihood of large-scale language models for the evaluation dataset DE; perform inference on DE and shuffled versions of DE. If the log probability of a large-scale language model on the non-shuffled dataset is statistically different from the shuffled version, it indicates contamination. The method is based on the assumption that evaluation datasets tend to exist in the same default order when they are included in the pre-training set.
Li (2023a)
- Compares parplexity of benchmark samples against memorized and clean baselines. Finds significant memory in recent models on key reading comprehension and summary benchmarks, but shows less evidence of contamination on multiple-choice benchmarks. This method provides the community with a tool for rigorous contamination analysis and allows for more accurate and reliable model evaluation.
Dong et al. (2024)
- We propose two new likelihood-based pollution detection methodologies, CDD (Contamination Detection by Output Distribution) and TED (Trustworthy Evaluation by Output Distribution). CDD detects data pollution by observing the peakedness of the output distribution in large language models and is performed in a black box fashion. This represents a significant advance over existing approaches, providing a 21.8% to 30.2% relative average improvement in Accuracy, F1 score, and AUC metrics.TED modifies the output distribution of large language models to reduce the impact of data contamination on evaluation metrics, Significantly reduced performance gains attributable to data contamination across a variety of scenarios and contamination levels.

Large-scale language models also establish a new paradigm in model contamination detection;Golchin & Surdeanu (2023b) propose guided prompting. This differs from standard prompts that complete data into a large-scale language model, but also includes additional information such as the name of the data set. Contamination is evaluated as the average performance difference between standard and guided prompting, or if GPT-4 using in-context learning finds an exact match or two approximate matches in guided completion. This latter method shows very high accuracy (92%-100%) in identifying contamination. Furthermore, this study highlights the prevalence of contamination in datasets such as AG News (Zhang et al., 2015), WNLI (Levesque et al., 2012), and XSum (Narayan et al., 2018) and inlarge-scale language modelapplications highlights the need to address data integrity.

The authorsalso present a Data Contamination Quiz (DCQ) evaluation framework for detecting contamination ofblack-boxlarge-scale languagemodels.The large-scale language modelunder investigationis presented with five completion options, including truthful text from the original data set, three paraphrases from the GPT-4, and a "none" option.If thelarge-scale languagemodel picks the correct answer, it is assumed to be due to memory, and the authors show that this DCQ framework finds more tainted cases than the guided prompting method.

New Evaluation Benchmarks

In order to provide a contamination-free assessment, the following new data sets have been introduced

LatestEval (Li et al., 2023b)
- A benchmark that leverages up-to-date text to create dynamic reading assessments. There are three steps: collecting up-to-date text, extracting key information, and creating questions based on that information using template embedding and large-scale language models.
WIKIMIA (Shi et al., 2024)
- A dynamic benchmark consisting of Wikipedia events created after 2023; a paraphrase setting that leverages ChatGPT to create paraphrased examples for evaluation has also been introduced, and a similar approach is used in CleanEval (Zhu et al., 2023).
KIEval (Yu et al., 2024)
- Interactive evaluation framework incorporating "interactors" with large scale language models. Multi-round dialogues with follow-up questions to ensure contamination-resistant ratings. The evaluator determines whether the model's answers are simply from memory or demonstrate a deep understanding of applying knowledge in more complex conversations.
LiveCodeBench (Jain et al., 2024)
- Continuous collection of new issues from LeetCode, AtCoder, and CodeForces, and uncontaminated code generation benchmarks. Some models (e.g., DeepSeek) performed poorly on new LeetCode issues, suggesting signs of contamination.
Termite (Ranaldi et al., 2024)
- A new text-to-SQL dataset whose public access via search engines is locked with a cryptographic key. Consists of manually created databases, each paired with about 5 queries, designed to match the properties of the Spider dataset (Yu et al., 2018), but unfortunately the Spider dataset suggests signs of high contamination at GPT-3.5.

In addition, Alzahrani et al. (2024) address the need for better evaluation benchmarks, noting that even small perturbations to existing benchmarks can affect model rankings on leaderboards.

Future Issues

Given the rapid and constant changes in machine learning research, future directions for taint detection in large-scale language models may include a wide range of methodological and technical aspects. This paper identifies several important focal points for discussion.

The first is a real-time contamination detection system.Real-time data contamination detection systems that continuously monitor the data stream and alert users when a contamination event occurs are particularly important in areas such as finance (Fresard et al., 2011) and healthcare (Gao et al., 2022), where data integrity is critical and model reliability is essential. This is especially important in areas such as finance (Fresard et al., 2011) and healthcare (Gao et al., 2022) where data integrity is critical and model reliability is essential. Given the vast amount of new data uploaded to the Internet on a daily basis, major technological breakthroughs are needed to solve this challenge.

The second is avoidance of contamination detection;Dekoninck et al. (2024) show a very effective way to avoid some existing contamination detection methods: Evasive Augmentation Learning (EAL), which paraphrases a benchmark with GPT-4 and paraphrase data to fine-tune a large-scale language model.

Third, an ethical and legal data framework.There is a need for a comprehensive ethical and legal framework (Chang, 2021) to govern the collection, use, and management of data used to studylarge-scale languagemodels. This includes policies and protocols for data privacy, consent, and use that help prevent the incorporation of contaminated data from unscrupulous sources and the contamination of widely used prior learning data sources (e.g., CommonCrawl). It is important to develop contamination detection techniques without compromising individual privacy.

LLMSanitize Library

To facilitate progress in taint detection in large-scale language models, we have built and published an open source Python library, LLMSanitize, which supports both data taint and model taint use cases, input/label taint. The minimal pseudo-code for using the library has the following structure

Users will need to specify training and evaluation datasets for data contamination, as well as large-scale language models and evaluation datasets for model contamination. Datasets andlarge-scale languagemodels are expected to be available in the HuggingFace Hub, specifically relying on the transformers (Wolf et al., 2020) and datasets (Lhoest et al., 2021) libraries.

Efficientlarge-scale language modelinference formodel-based use casesis handled using the vLLM library (Kwon et al., 2023). Other key parameters to control include the number of evaluation data points to be processed and the hyperparameters specific to each contamination detection method.

LLMSanitize has also been evaluated. The three recently popular model contamination methods described above, guided prompting (with ROUGE-L), sharded likelihood, and Min-K% Prob, are applied to four widely used 7B-sized large-scale language models (Llama2, Qwen-1.5, Mistral and Gemma). We use a chatbot version of each large-scale language model and apply it to six datasets from the HuggingFace OpenLLM Leaderboard, ARC (Clark et al., 2018),HellaSwag (Zellers et al., 2019),MMLU ( Hendrycks et al., 2020),TruthfulQA (Lin et al., 2021),Winogrande (Sakaguchi et al., 2021), andGSM8K (Cobbe et al., 2021) to measure contamination. We subsampled 100 data from each test set.

Finally, we report likelihood-based results with uncontaminated data using a March 2024 BBC news article for comparison;the results forguided prompting (with ROUGE-L),sharded likelihood, and Min-K% Prob in that order are as follows

In particular, the HellaSwag benchmark (Zellers et al., 2019) shows strong indications of contamination, regardless of the method and LLM.Thus, LLMSanitize is a powerful tool for contamination detection and is expected to be used in future research and practice.

Summary

Large-scale language models are rapidly evolving and the data used for training is growing, but their performance is easily biased by data contamination. This paper providesa detailed survey of the current state of the art in taint detection inlarge-scale language modelsand organizes it in a systematic way. It also introduces a new Python library, LLMSanitize, a general-purpose tool for rapid detection of contamination in a variety of models and data sets.

This detailed study and the proposed tools areexpected to provide a foundation forfuturelarge-scale languagemodels to address data and model contamination issues.

Categories related to this article

Takumu: I have worked as a Project Manager/Product Manager and Researcher at internet advertising companies (DSP, DMP, etc.) and machine learning startups. Currently, I am a Product Manager for new business at an IT company. I also plan services utilizing data and machine learning, and conduct seminars related to machine learning and mathematics.

LLMSanitize" A Comprehensive Survey Of Data Contamination Areas In Large-scale Language Models

Summary

Data Contamination Review

Model Pollution Review

New Evaluation Benchmarks

Future Issues

LLMSanitize Library

Summary

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Libra] A New Multimodal Design Of Large Language Models Using Separate Vision Systems

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

Construction And Analysis Of The "TruthEval" Dataset To Expose LLM Weaknesses

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

SportQA, A New Dataset That Measures The Comprehension Of Sports In Large Language Models

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

Proposal For A New Evaluation Method For AI Assistants Based On Human Preferences

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Models

The Future Of Music Education, Flute X GPT And LAUI's Potential To Change Large-Scale Language Model ...

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Prediction Using LLM

Prediction Of Handball Results For The 2024 Paris Olympics And Explanation Of The Basis For The Pred ...