First Systematic Review Of The "Dataset For Evaluating The Safety Of LLMs"
3 main points
✔️ First comprehensive review of public datasets to assess and improve the safety of large-scale language models
✔️ 102 datasets will be developed between 2018 and 2024, with a particularly rapid increase in 2023, and a variety of rapidly developing datasets
✔️ Language bias in datasets and uniqueness of evaluation methods are challenges, requiring more standardized evaluations
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
written by Paul Röttger, Fabio Pernisi, Bertie Vidgen, Dirk Hovy
(Submitted on 8 Apr 2024)
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Since the release of the large-scale language model service, it has been used by many corporations and individuals due to its high usefulness. At the same time, however, ensuring the safety of large-scale language models has become an important issue for model developers and regulators. In recent years, researchers and practitioners have identified an urgent need for new data sets to assess and improve the safety of large-scale language models. And many studies have been reported. However, there is no clear definition of safety, as it is multifaceted and depends on individual contexts. Because of this complexity,datasets to assess safetyare rapidly being developed in a wide variety of contexts.
For example, in January~February 2024 alone, they have published data sets to assess various risks. These include a dataset on socioeconomic bias (Gupta et al., 2024), a dataset on harmful content generation (Bianchi et al., 2024), and a dataset assessing long-term risks related to power orientation (Mazeika et al., 2024) (Mazeika et al., 2024). Thiswide variety ofdatasetsmakes it very difficult for researchers and practitioners to find the most appropriate dataset for their individual use cases.
This paper provides the first comprehensive review of publicly available datasets for assessing and improving the safety of large-scale language models.102 datasets published between June 2018 and February 2024 are identified and collected based on clear selection criteria. We then examine these datasets along several axes, including purpose, method of creation, format and size, access and licensing, and more.
An analysis of the latest developments in the safety of large-scale language models also reveals that datasets are being created at a rapid pace, driven primarily by academic institutions and non-profit organizations. It also confirms the increasing use of specialized safety assessments and synthetic data, and that English is the predominant language of the datasets.
In addition, we are reviewing how thepubliclyavailable datasets are actually being used through the publication of model releases and benchmarking of popular large-scale language models. As a result, we found that current evaluation methods are highly proprietary and utilize only a small fraction of the available datasets.
Review Method
The review in this paper is limited to open datasets focused on safety assessment and improvement of large-scale language models. Only textual datasets are addressed; datasets for image, audio, and multimodal models are not included.
There are no restrictions on data format, but since interaction with the large language model is often done in text chat format, data sets containing open-ended questions or instructions, multiple choice questions, autocomplete style text snippets, etc. are also included. No restrictions are placed on language.Inaddition, data access is limited to datasets that are publicly available on GitHub or Hugging Face. No restrictions are placed on the form of data licensing.
Finally, all data sets must be related to the safety of the large-scale language model. The definition of safety is broad and includes data sets related to representational, political, and sociodemographic bias, harmful instructions or advice, risky behavior, social, moral, and ethical values, and the use of confrontational large-scale language models.It does not include datasets related to general datasets targeting the ability oflarge-scale languagemodels, misinformation generated, or measurement of truthfulness.The deadline for this review is March 1, 2024. Datasets published after this date are not included.
The paper also uses an iterative, community-driven approach in the search for datasets. We use a combination of snowball sampling to identify potential datasets, publishing the first version of SafetyPrompts.com in January 2024 with an initial list of datasets and advertising it on Twitter and Reddit to solicit feedback and additional suggestions. This led to the collection of 77 datasets, followed by snowball sampling for an additional 35 datasets. Ultimately, 102 open datasets published between June 2018 and February 2024 are included in the review.
In this paper, the authors state that they chose this method for two reasons: first, safety of large-scale language models is a rapidly evolving field, and feedback from a wide range of stakeholders is important. The second reason is to ensure that relevant data sets not captured by traditional keyword searches are not missed. For example, keywords such as "language model," "safety," and "dataset" can yield many results on Google Scholar and elsewhere, but may miss important datasets.
It also records 23 structured information for each of the 102 datasets included in the review. This covers the entire dataset development pipeline, including how each dataset was created, how it looks, how it can be used, how to access it, and where it can be published. The table below is a codebook describing the structure and content of the spreadsheet for this review. The code to reproduce this spreadsheet and analysis is available at github.com/paulrottger/safetyprompts-paper.
Review Results
Research on the safety of large-scale language models builds on a long history of risk and bias in language models. The first datasets were published in 2018 and were intended to assess gender bias. They were intended for coreference resolution systems, butare applicable tocurrentlarge-scale language models.These datasets build on previous work on bias in word embedding and show that concerns about the negative social impact of language models are not new.
Similarly, Dinan et al. (2019) and Rashkin et al. (2019), among others, introduced datasets to evaluate and improve the safety of dialogue agents before the current generative large-scale language model paradigm. However, there was relatively little interest in safety at the time, and only 9 (8.9%) of the 102 datasets reviewed in this paper were published before 2020.
Research on the safety oflarge-scale languagemodels is experiencing a period of moderate growth through 2021 and 2022. During these two years, 15 and 16 open datasets have been released, respectively. This is consistent with the growing interest in generative language models, especially following the release of GPT-3 (Brown et al., 2020) among researchers.
Finally, the paper confirms that research on the safety of large-scale language models is currently experiencing unprecedented growth. Of the 102 datasets included in the review, 47 (46.1%) are available in 2023.This is consistent with the growingpublic interest inlarge-scale language modelsfollowing the release of ChatGPT (November 2022)andconcerns about the safety oflarge-scale language models; with 15 datasets released in the first two months of 2024 alone, more open datasets are expected to be released in 2024.
The paper also categorizes the purpose of the dataset into five main categories. First, extensive safety (n=33) refers to datasets that cover several aspects of safety inlarge-scale language models. This includes structured evaluation datasets such as SafetyKit (Dinan et al., 2022) and SimpleSafetyTests (Vidgen et al., 2023), as well as datasets such as BAD (Xu et al., 2021) and AnthropicRedTeam (Ganguli et al. ., 2022), and extensive red teaming datasets such as
Second, narrowly defined safety (n=18) refers to data sets that focus only on specific aspects of safety in large-scale language models. For example, SafeText (Levy et al., 2022) focuses on general physical safety, while SycophancyEval (Sharma et al., 2024) focuses on following behavior.
Value integrity (n=17) refers to datasets on ethical, moral, or social behavior inlarge languagemodels. It includes datasets that assess understanding of ethical norms, such as Scruples (Lourie et al., 2021) and ETHICS (Hendrycks et al., 2020a), opinion surveys such as GlobalOpinionQA (Durmus et al., 2023) The following are some of the most important.
Bias (n=26) refers to datasets that assess sociodemographic bias inlarge-scale languagemodels. For example, BOLD (Dhamala et al., 2021) assesses bias in text completion, and DiscrimEval (Tamkin et al., 2023) assesses bias in specific LLM decision making.
Others (n=8) includedatasets for the development oflarge language modelchat moderation systems (e.g., FairPrism (Fleisig et al., 2023) and ToxicChat (Lin et al., 2023)), and from public prompt hacking competitions A collection of professional prompts (e.g., Gandalf (LakeraAI, 2023a), Mosscap (LakeraAI, 2023b), and HackAPrompt (Schulhoff et al., 2023)) are included.
The figure below (reproduced below) shows that the early safety datasets were primarily concerned with assessing bias; 13 (54.2%) of the 24 datasets released from 2018 to 2021 were created to identify and analyze sociodemographic bias in the language model. Twelve of these datasets assessed gender bias, including those that assessed it along with other bias categories (e.g., race and sexual orientation).
Extensive safety is a major theme with industry contributions in 2022. For example, Anthropic released two extensive red teaming datasets (Ganguli et al., 2022; Bai et al., 2022a), and Meta released a dataset on positive differential scale language model interaction (Ung et al., 2022) and a general safety assessment related datasets (Dinan et al., 2022). More recently,broad safety has shifted to more structured assessments, as seen in benchmarks such as DecodingTrust (Wang et al., 2024) and HarmBench (Mazeika et al., 2024).
The results of the review in this paper suggest a trend toward more specialized safety assessments. Narrowly defined safety assessments did not emerge until 2022, but now make up the bulk of the new datasets; in the first two months of 2024 alone, six of the 15 datasets included in the review (40.0%) were related to rule compliance (Mu et al., 2024), privacy inference capabilities (Mireshghallah et al., 2024), and others related to specific aspects of the security of large-scale language models.
Finally, it is clear that most of the datasets are intended solely for model evaluation. Of the 102 datasets included in this review, 80 (78.4%) were explicitly created for benchmarking and evaluation, not model training. In contrast, only four datasets (3.9%)contain examples of positive interactions betweenusers andlarge-scale languagemodelscreated specifically for model training.
It also reviews the format and size of the datasets. The paperfinds that theformat of safety datasets forlarge-scalelanguagemodels has changed with the general development trend oflarge-scale languagemodels. Early datasets, especially those created for bias assessment, often use an autocomplete format (n=8). This is a format in which the model is tasked with filling in masked words or completing snippets of sentences.These formatswere most appropriate forearlylarge-scale language models(e.g., BERT and GPT-2), butchat-style prompts and conversations (n=58) and promptable multiple-choice questions (n=14) are more appropriate forcurrentlarge-scale language models.
The 102 datasets included in the review vary widely in size, but there is no clear pattern corresponding to the different dataset characteristics, such as the purpose of the dataset or how it was created. The smallest dataset is ControversialInstructions by Bianchi et al. (2024), which contains 40 author-generated promptsinstructinga large-scale languagemodel to generate hate speech. The largest dataset is HackAPrompt by Schulhoff et al. (2023), which contains 601,757 human-written prompts recorded as part of a prompt hacking competition.
In addition, the review was conducted in terms of how the datasets were created.Of the 102 datasets reviewed, 17 (16.7%) used human-written phrase or prompt templates, which were combined to create large evaluation datasets. For example, HolisticBias (Smith et al., 2022) used 26 sentence templates (e.g., "I am [attribute 1] and [attribute 2].") and combines approximately 600 different descriptive words to create 459,758 test cases. This template approach has been particularly popular for bias evaluation, with 13 of the 26 bias evaluation datasets using this method. Templates have also recently been used for general safety evaluation of large language models (Wang et al., 2024) and privacy inference (Mireshghallah et al., 2024).
A growing number of recently published datasets are fully synthetic. While early safety datasets collected human-written prompts, the first dataset with fully model-generated prompts was released in 2023.12 of the 47 datasets released in 2023 consisted primarily of prompts, sentences, and multiple choice questions generated by variations of the GPT-3.5 consisting primarily of prompts, sentences, and multiple-choice questions generated by a variation of the GPT-3.5. For example, Shaikh et al. (2023) used GPT-3.5 to generate 200 harmful questions to investigate safety in chain-of-sort (CoT) question answering.
In addition, instead of using static templates for data preparation, several recent datasets have been flexibly augmented using large-scale language models. For example, Bhatt et al. (2023) augmented a small set of expert-written cyberattack instructions to a large set of 1,000 prompts using Llama-70b-chat (Touvron et al., 2023a); Wang et al. (2024) took a similar approach to build a large DecodingTrust benchmark.
A small handwritten prompt dataset for model evaluation also exists. Of the 102 datasets reviewed, 11 (10.8%) were written by the authors themselves and consist of several hundred prompts, assessing specific model behaviors (e.g., rule compliance (Mu et al., 2024) or exaggerated safety (Rottger et al., 2023)).
The language of the data sets is also reviewed.The majority of the safety datasets are in English only. Of the 102 datasets reviewed, 88 (86.3%) are English-only; 6 (5.9%) focus exclusively on Chinese (e.g., Zhou et al., 2022; Xu et al., 2023; Zhao et al., 2023); one dataset (Nevéol et al., 2022) measures social bias in the French model. The other seven datasets (10.8%) cover English and one or more other languages; Pikuliak et al. (2023) covers 10 languages, for a total of 102 datasets reviewed covering 19 different languages.
We also review in terms of data access and licensing;GitHub is the most popular platform for sharing data, with only 8 (7.8%) of the 102 datasets not shared on GitHub. These 8 datasets are available on Hugging Face; 35 datasets (34.3%) are available on both GitHub and Hugging Face. despite the growing popularity of Hugging Face, there is no clear trend toward a higher percentage being available on Hugging Face. Despite the growing popularity of Hugging Face, there is no clear trend toward a higher percentage being available on Hugging Face.
Furthermore, when data is shared, the licenses used are often permissive. The most common license is the MIT license, which 40 of the 102 datasets (39.2%) use; 14 datasets (13.7%) use the Apache 2.0 license, which provides additional patent protection; 27 datasets (26.5%) use the Creative Commons BY 4.0 license, which requires that appropriate credit be provided and that any changes made to the dataset be noted. 5 datasets (4.9%) use the CC BY-NC license, which prohibits commercial use. 2 datasets (2.0%) use a more restrictive custom license; as of March 25, 2024, 19 datasets (18.6%) have not specified a license.
It also states that the creation and publication of datasets is primarily driven by academic institutions and non-profit organizations. Of the 102 datasets reviewed, 51 (50.0%) were published by authors affiliated exclusively with academic institutions or non-profit organizations; 27 (26.5%) were published by industry and academic teams; 24 (23.5%) were published by industry teams.It is also evident that the creation of datasets is concentrated in a small number of research centers.
It is evident that a variety of evaluation data sets have been created through the use of templates and synthetic data. It also provides important insights into language diversity and data access and licensing.
Use of Safety Datasets in Model Release Publication
This section presents the results of a survey of how safety datasets are used in practice. In particular, we investigate which safety datasets are used to evaluate the latest large-scale language models prior to their release, based on publicly available documentation of model releases. We also survey the safety datasets included in popular large-scale language model benchmarks todetermine the current state of norms and common usage in the safety evaluation oflarge-scale languagemodels.
This paper includes the top 50 highest performing large-scale language models listed on the LMSYS Chatbot Arena Leaderboard as of March 12, 2024.The LMSYS Leaderboard is a crowdsourced platform for the evaluation of large-scale language models and has been used by over 400It calculates and ranks models' ELO scores based onmore than 40,000 pairwise human preferencevotes.The LMSYS Leaderboard is used because it is very popular in the large language model community and covers the latest model releases from industry and academia.
The Top 50 entries correspond to 31 unique model releases. Of these 31 models, 11 (35.5%) are proprietary models accessible only via API. These are models released by OpenAI (GPT), Google (Gemini), Anthropic (Claude), Perplexity (pplx), and Mistral (Next, Medium, Large). The other 20 models (64.5%) are open models accessible via Hugging Face. On the leaderboard, proprietary models generally outrank open models, with Qwen1.5-72b-chat the highest ranked open model at #10. 26 of the 31 models (83.9%) were released by industrial labs, while the rest were created by academic or non-profit organizations. The rest were created by academic or non-profit organizations. All 31 models were released in 2023 or 2024.
The review found that the majority of modern large-scale language models undergo safety assessments prior to release, but the scope and nature of these assessments varies: 24 of the 31 models (77.4%) report safety assessments in their release public materials; 21 models (67.7%) report results for at least one dataset. For example, Guanaco (Dettmers et al., 2024) was evaluated in a single safety dataset (CrowS-Pairs by Nangia et al., 2020). In contrast, Llama2 (Touvron et al., 2023b) was evaluated in five different safety datasets; seven of the 31 models did not report any safety assessment. This includes five open models from academia and industry, including Starling (Zhu et al., 2023) and WizardLM (Xu et al., 2024), and proprietary Mistral Medium and Next models.
We also found that proprietary data play a major role in the safety assessment of model releases. Thirteen (54.2%) of the 24 model releases that reported safety assessment results used non-public proprietary data to assess model safety. Three of these releases (Gemini (Anil et al., 2023), Qwen (Bai et al., 2022b), and Mistral-7B (Jiang et al., 2023)) report only proprietary data set results.
In addition, we found that the diversity of safety datasets used in model release evaluations is very limited: there are only a total of 12 open LLM safety datasets used across the 31 model releases, and 7 of these are used only once. In particular, TruthfulQA (Lin et al., 2022) is used in 16 (66.7%) of the 24 model releases reporting safety assessment results. All other datasets are used in at most 5 model release publications.
Thus, it can be seen that safety datasets play an important role in model releases and serve as the basis for safety assessment. However, their use is limited, and it is expected that more diverse datasets will be utilized.
Safety Data Sets Used in Major Benchmarks
The followingfivewidely usedgeneric benchmarks are examined
- Stanford's HELM Classic (Liang et al., 2023)
- HELM Instruct (Zhang et al., 2024)
- Hugging Face's Open LLM Leaderboard (Beeching et al., 2023)
- Evaluation Harness of Eleuther AI (Gao et al., 2021),
- BIG-Bench (Srivastava et al., 2023)
We also examine two benchmarks that focus primarily on the safety of large-scale language models, TrustLLM (Sun et al., 2024) and the LLM Safety Leaderboard.
The review found that there are significant differences in the way each benchmark assesses the safety of large language models: a total of 20 safety datasets are used across the seven benchmarks, 14 of which are used in only one benchmark. For example, TrustLLM (Sun et al., 2024) uses 8 safety datasets, 6 of which are not used in any other benchmark; the only safety dataset used in more than one benchmark is TruthfulQA (Lin et al., 2022) is used in five benchmarks, and RealToxicityPrompts (Gehman et al., 2020) and ETHICS (Hendrycks et al., 2020a) are used in three benchmarks.
Summary
A review by this paper shows that the growing interest in the safety of large language models is driving the diversification of datasets on the safety of large language models, with more datasets being released in 2023 than ever before, a trend that is expected to continue this year! The data sets are being released in a wide variety of formats.And existing datasets serve a variety of purposes and formats, adapting over time to the uses of users and developers of large-scale language models.
On the other hand, however, challenges have been observed. One of the most prominent current challenges is the lack of data sets in languages other than English. We found that the overwhelming majority of current safety datasets are in English. This may reflect research trends in natural language processing over the years. The language bias of the datasets reflects the language bias of who publishes the datasets. This bias could be improved if non-U.S. institutions took the lead in creating the dataset in their native language.
Analysis of the actual use of safety datasets for large-scale language models also reveals that there is room for improvement in the standardization of safety assessments.Safety evaluation is an important priority for model developers and users,as evidenced by the publication of model releases and the inclusion of safety evaluation in popular large-scale language model benchmarks.However, it has become clear that thesafety assessment methods used to date arehighlyproprietary, with most model release publications and benchmarks using different data sets. A more standardized and open evaluation would allow for more meaningful model comparisons and provide incentive to promote the development of safe large-scale language models.
The challenge for standardization is to clarify which assessments constitute the appropriate standard. In this paper, we find that the different datasets reviewed also serve different purposes and are difficult to compare simply by the same standard. First, it ishoped thatthis review will helpto recognize the diversity of large language model datasetscurrently available,which can be used forfuturedataset development.
Categories related to this article