Data Contamination Detection Methodology Using Large-Scale Language Models
3 main points
✔️ Propose a new pollution detection method "LLM Decontaminator" using large-scale language models to overcome the limitations of existing detection methods
✔️ Introduce new one-time tests (e.g., coding competitions) toevaluate large-scale language models , rather than relying on static benchmarksProposed to evaluate large language models
✔️ Paraphrased test samples should be defined as contamination, as their inclusion in training data distorts benchmark results
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
written by Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica
(Submitted on 8 Nov 2023 (v1), last revised 11 Nov 2023 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
While large-scale language models are rapidly evolving, their evaluation is becoming increasingly difficult. Many benchmarks have been established in a short period of time to evaluate the performance of large-scale language models, but these scores do not necessarily reflect real-world performance. Furthermore, it has been pointed out that these benchmark data sets may have been contaminated by preprocessing and fine-tuning processes.
For example, the Llama-2 contamination analysis (Touvron et al., 2023) found that more than 10% of the test sample for massive multitasking language understanding (MMLU) was contaminated. Also, a GPT-4 technical report (OpenAI, 2023) found that 25% of HumanEval was contaminated with training data. Similar problems exist in open source datasets, with StarCoder Data (Li et al., 2023) showing that hundreds of test cases are tainted.
The problem of contamination, although considered important,remains difficult to detect accurately. Common methods include n-gram overlap and embedded similarity search.n-gram overlap isbased on string matching and has been widely used in GPT-4 (OpenAI, 2023), PaLM (Anil et al., 2023), Llama (Touvron et al. 2023), but its accuracy is limited. Embedded similarity search, on the other hand, uses pre-trained model embeddings to find similarly contaminated samples, but it is difficult to strike a balance between recall and accuracy.The increasing use of synthetic data generated by large-scale language models also makes it even more difficult to detect their contamination; Phi-1 reports (Gunasekar et al., 2023) that some synthetic data similar to HumanEval's test samples are n-gram overlap, which is not detectable.
This paper proposes the concept of a "rewritten sample" (Rephrased Sample) to study methods of decontamination. These are samples that have the same meaning as the original sample but are difficult to detect with existing taint tests. They are generated by paraphrasing or translating test samples into other languages using large-scale language models. It has been shown that when such rewritten samples are used for training, the model is prone to overtraining and achieves very high performance on test benchmarks.
Fine-tuned 13B Llama models in benchmarks such as MMLU, GSM-8k, and HumanEval rival the performance of GPT-4 and are observed in n-gram overlap, a phenomenon that is not detected as a contamination.
The paper also provides a detailed analysis of why existing methods of decontamination fail andproposes anew large-scale languagemodel-based decontamination method. The method first uses embedded similarity search to obtain the top k samples that are most similar to a given test sample, and then uses a powerful large-scale language model such as GPT-4 to check whether these samples are very close to the test case. Results show that this method is significantly superior to existing methods.In addition, we successfully apply theproposed methodtoa widely used pre-training and fine-tuning datasetto reveal previously unknown test overlap with public benchmarks.
In theRedPajama-Data-1T and StarCoder-Datapre-trainingsets, 8-18% of the HumanEval benchmarks were identified as duplicates. We also found that CodeAlpaca (Chaudhary, 2023), a synthetic dataset generated by GPT-3.5, contains 12.8% of the rewritten samples from HumanEval.This indicates the risk of contamination when training with synthetic data generated bylarge languagemodels.
This paper calls for improved tainting methods in public benchmarks for evaluating large language models. Since the current evaluation criteria may not reflect the true performance of the models, more reliable taint removal methods need to be introduced.
It is also recommended that a one-time test in the form of a competition, such as Codeforces or Kaggle, be proposed to accurately evaluate large-scale language models. This is expected to provide a more accurate measure of a model's actual capabilities and reduce the risk of contamination.
Rewriting Sample Concepts
In the evaluation of large language models, it is important to investigate how variations in the test sets included in the training set affect the final benchmark performance. We refer to these test case variations as "rewritten samples". In our experiments, we consider a variety of benchmark domains, including math, knowledge, and coding. The example below is a rewritten sample of GSM-8k; the 10-gram overlap is undetectable, but the meaning remains the same.
There are several differences in rewriting techniques, since contamination of benchmarks can manifest itself in different ways. Text-based benchmarking rewrites test cases without changing their meaning by changing the order of words or replacing them with synonyms. Code-based benchmarks rewrite while maintaining meaning by changing coding styles, naming conventions, and implementation methods.
The rewriting process employs a simple algorithm as shown in the figure below. This method uses a large language model (e.g., GPT-4) to generate a rewritten version of the test prompt, ensuring that it is not detected by detection methods such as n-gram overlap. Use a non-zero initial temperature setting to prompt a variety of outputs. Apply this process to each prompt in the test set to build a test set by rewriting. The term "RephraseLLM" refers to high-performance large-scale language models (such as GPT-4 and Claude), while "isContaminated" refers to methods of taint detection such as n-gram overlap and embedded similarity search.
In addition to reordering words, there are a variety of other rewriting techniques. In real data sets, many rewriting techniques exist, including translation techniques. Using these techniques hides the rewritten samples and significantly improves the model scores.
Prompts with the same meaning from different languages produce different embeddings in most language models. Translating test prompts into other languages avoids n-gram overlap detection and embedding similarity search. Only embedding models specifically trained on multiple languages can detect translated samples.
For text-based data, translation techniques can significantly improve scores while avoiding both n-gram overlap and embedded similarity search. This method takes advantage of the model's multilingual translation capabilities and effectively translates the knowledge assessment into a translation task. Translation techniques are also useful in code benchmarking. Programs that solve the same problem can be translated from Python to C or Java to see the effect. To further investigate the impact of translation techniques on code benchmarking, we propose a multilingual data extension.
Code benchmarking can enhance translation techniques with multilingual data extensions. Incorporating multiple languages enhances the generalizability of the model and helps the client understand that the translated code and the original code perform the same function.In this way, an understanding of the concept of rewriting samples and the techniques involved can help evolve more accurate and effective methods of evaluating large-scale language models.
Pollution Detection Methods Using Large-Scale Language Models
This paperintroduces a new taint detection method, the LLM Decontaminator, which accurately removes paraphrased samples from a data set against a benchmark.This new algorithm "LLM Decontaminator" is proposed to overcome the limitations of existing detection methods such as n-gram overlap and embedded similarity search.
The algorithm consists of two steps:the first step identifies the top k most similar training items for each test case using an embedded similarity search; thesecond step then uses a high-performance large-scale language model such as GPT-4 to evaluate whether each pair is identical or not.
This method allows you to determine how many paraphrase samples are contained in a data set at a modest computational cost. The "template" is a structured prompt that combines the test and training cases and instructs the LLMDetector to perform a comparison and return "True" or "False". True" indicates that the training case may be a paraphrased sample of the test case. The LLMDetector is a high-performance large-scale language model such as GPT-4, and TopKSimilarity uses embedded similarity search to identify the top k most similar samples in the training data.
The figure below shows Venn diagrams for different contamination detection methods. The Venn diagram shows a subset of the training data and the range of contamination detection. The solid circlesrepresent thetrainingdata and its subset. The dotted circles enclose areas in the dataset that are indicated by the detection methods as potentially contaminated.
The LLM Decontaminator leverages embedded similarity search to quickly filter out contamination. In addition, it leverages the trusted judgment of large language models. n-gram overlap detection can lead to high false negative rates when detecting paraphrase samples, and embedded similarity search can detect many false positives with a high threshold. In particular, the LLM Decontaminator shows higher accuracy in detecting paraphrased samples.
Experiment - Impact of Reconstructed Samples on Benchmarks
Here we show that the model trained on the reconstructed samples achieves very high scores, matching the performance of GPT-4 on three widely used benchmarks, MMLU, HumanEval, and GSM-8k. This suggests that the reconstructed samples are contaminated data that should be removed from the training data.We havealso evaluated different contamination detection methods andappliedDecontaminatorto awidely used training setto find new contamination.
The first benchmark, the MMLU (Hendrycks et al., 2020), is a benchmark that includes a very broad range of subjects, covering 57 areas from abstract algebra to professional psychology. In order to reconstruct this MMLU, many scenarios need to be considered; given the complexity of the MMLU and its multiple-choice format, it is necessary to elaborate on the details of the reconstruction.
In addition, the use of n-gram overlap detection in multiple-choice questions tends to produce false positives when different questions share similar choice placement. The figure below shows an example of a false positive from n-gram overlap detection. These are actually different questions, even though the pattern of choices is exactly the same. To reduce this false positive problem, the MMLU experiment introduces a "question-only" control group. Question-only" means that only the body of the question is reconstructed, while "full prompt" meansthatboth the body of the question and the choices are reconstructed.
Also, large numbers often cause duplication of characters. To avoid this, we have changed the format of large numbers, for example, by using alternating commas and spaces. Terminology from various disciplines can also cause duplication problems. To avoid this, we alternate between abbreviations and full terms and adjust capitalization, especially for options that include names and chemical formulas.
Llama-2-7b and Llama-2-13b trained on the reconstructed test set in 16 epochs. As shown in the table below, Llama-2 7B and 13B trained on the reconstructed sample achieve very high scores on MMLU, ranging from 45.3 to 88.5. This suggests that the reconstructed samples may significantly distort the benchmark numbers and should be treated as contaminated data. The original model is tested with 5 shots, while the model trained on the reconstructed data is tested with zero shots.
The second benchmark, "HumanEval" (Chen et al., 2021), is a benchmark provided by OpenAI to assess the coding ability of large language models. In this benchmark, the model is provided with an incomplete code fragment and asked to complete it.
The HumanEval test set is paraphrased in Python, which is then translated into five programming languages: C, JavaScript, Rust, Go, and Java. These codes are used to study CodeLlama 7B and 13B, respectively. We then built a multi-programming language dataset containing these five programming languages and trained on it. The table below shows CodeLlama's performance on the paraphrased Python, paraphrased C, and multiprogramming language datasets.
CodeLlama 7B and 13B, trained on the paraphrased sample, achieve high scores on the HumanEval. In contrast, GPT-4 achieved only 67.0 on the HumanEval.
The third benchmark, the "GSM-8K" (Cobbe et al., 2021), is a representative benchmark used to assess the mathematical capabilities of large-scale language models.
The table belowshows that Llama-2 7b and 13b trained on the paraphrased sample achieve higher scores on GSM-8K. The original model wastestedwith 5 shots, whilethe model trained on the paraphrased data was tested with zero shots.
Experimental - Evaluation of Contamination Detection Methods
The first benchmark, MMLU, is a decontamination benchmark based on three MMLU subjects: abstract algebra, sociology, and U.S. history. To compare the accuracy of our detection methods on paraphrased samples, we construct 200 prompt pairs using both the original and paraphrased test sets. These pairs consist of 100 random pairs and 100 paraphrased pairs. F1 scores for these pairs indicate contamination detection ability, with higher scores representing more accurate detection.
Random detection (Random) is used as a baseline, and a score significantly above random detection indicates the effectiveness of the detection method.
As shown in the table below, all detection methods except LLM Decontaminator contain false positives. Paraphrased and translated samples are not detected by n-gram overlap; when using multi-qa BERT, the embedding similarity search is completely invalid for translated samples; Multilingual BERT shows low scores in the American History subject. The scores show the reliability and accuracy of the LLM Decontaminator.
Thesecondbenchmark,HumanEval, is aboutHumanEval.Weshow that existing detection methods fail to detect HumanEval paraphrased samples andconfirm thatthe LLM Decontaminatoris successful in detecting them. 200 prompt pairs are constructed for HumanEval according to the aforementioned MMLU methods.The F1 score is evaluated usingn-gram overlap, embedded similarity search, and theLLM Decontaminator.The table below shows that embedded similarity search is effective for detection within the same programming language, but less effective after translation. Of the methods investigated, only the LLM Decontaminator reliably detects paraphrased samples.
To further demonstrate the effectiveness of the LLM Decontaminator, we have applied it to a widely used real-world dataset and identified a large number of paraphrased samples. The table below displays the contamination rates of different benchmarks for each training dataset.
CodeAlpaca (Chaudhary, 2023) is a synthetic dataset generated by the Self-instruct technique (Wang et al., 2023b) using OpenAI's Davinci-003. CodeAlpaca-20K is used to train many well-known models including Tulu (Wang et al., 2023a). GPT-4 is used for detection with k=1 as a parameter, and 21 paraphrased samples from the HumanEval test set are found to be present, accounting for 12.8%. The figure below shows the paraphrased samples of HumanEval within CodeAlpaca.
RedPajama-Data-1T (Computer, 2023) is a widely used dataset for training open source models; both MPT (Team, 2023) and OpenLlama (Geng & Liu, 2023) use it as a pre-training dataset. In this paper, we sampled 16G of data from the GitHub subset and used LLM Decontaminator to perform detection, identifying a total of 14 HumanEval paraphrased samples. The figure below shows the paraphrased samples of HumanEval within RedPajama.
MATH (Hendrycks et al., 2021) is awidely recognized math learningdatasetspanning a variety of math disciplines including algebra, geometry, and number theory, contributing to many math-centric datasets such as MathInstruct1 (Yue et al., 2023). The LLM Decontaminatorreveals 79 paraphrased samples, or 1.58% of the MATH test set.The example below shows aparaphrasedsample ofMATH tests in the MATH training data.
FLAN (Longpre et al., 2023) is a comprehensive knowledge learning dataset that includes a variety of data sources; we utilize the CoT subset, which represents 1.63% of FLAN, use GPT-4 for detection, and set the decontamination parameter to k=1. Findings indicate that 76 test cases, 0.543% of the MMLU test set, are paraphrased.
Summary
This paper examines the problem of benchmark contamination in large language models and an evaluation of existing decontamination methods. We show that existing detection methods are unable to detect test cases with simple variations. If such variations in test data are not eliminated, the 13B model is prone to overfitting test benchmarks, resulting in very high performance.
To address this, we propose a new detection method, the LLM Decontaminator. The method is applied to a real data set and reveals previously unknown test overlaps. The paper strongly encourages the research community to adopt stronger decontamination measures when using public benchmarks.
Paraphrased test samples should be considered tainted because their inclusion in the training data could affect the benchmark. However, the precise definition of contamination remains a difficult challenge. For example, in GSM-8k, training and test data may differ only by a number. Models trained under these conditions can memorize solutions, but generalization to unseen patterns becomes difficult. As a result, benchmark numbers may not accurately reflect a model's ability to solve mathematical problems.
As models are increasingly trained on data generated by large-scale language models, the potential for unintended contamination increases. For example, some contamination was found in the CodeAlpaca dataset generated by GPT. The authors suggest that attention should be paid to potential contamination when training models on synthetic data. We suggest that model developers adopt stronger decontamination measures.
Furthermore, while the proposed decontamination method can be a useful tool, how to detect contamination without access to training data remains an open problem. Rather than relying on static benchmarks, we suggest creating new one-time problems to evaluate large language models. For example, in the coding domain, a weekly coding competition such as CodeForces could be used. The benchmarks should be updated as quickly as the models are developed, according to the report.
Categories related to this article