 
  Proposal For MSciNLI, A Diverse Benchmark For Scientific Natural Language Reasoning
3 main points
✔️ Proposed "MSciNLI" diverse dataset for scientific natural language inference tasks
✔️ Establish baselines with pre-trained language models and large-scale language models
✔️Analyzescomprehensivemodel performanceunder domain shift
MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference
written by Mobashir Sadat, Cornelia Caragea
(Submitted on 11 Apr 2024)
Comments: Accepted to the NAACL 2024 Main Conference
Subjects: Computation and Language (cs.CL)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Summary
Natural Language Inference (NLI) is the task of recognizing the semantic relationship between two sentences. The first sentence is referred to as the "premise" and the second sentence as the "hypothesis. Traditional natural language inference datasets include SNLI, MNLI, SICK, and ANLI, which classify hypotheses into three classes: premise implies, contradicts, or is neutral. These datasets are used not only as benchmarks for natural language understanding (NLU), but also for downstream tasks such as fact checking and fake news detection. They also contribute to advances in expression learning, transfer learning, and multi-task learning.
However, since the samples in these datasets are mainly derived from the general domain, they do not adequately capture linguistic properties for the scientific domain. Thus, the task of scientific natural language inference and the first dataset, SciNLI, was introduced; SciNLI contains sentence pairs extracted from scientific papers related to computational linguistics, and to perform inference specific to scientific papers, the three classes of traditional natural language inference were replaced by four classes (INTENT/ While SciNLI has received a great deal of attention in the research community, its restriction to a single domain (a single domain; ACL) has made it difficult for general It lacks diversity in natural language inference benchmarks in the scientific community.
Therefore, this paper proposes MSciNLI, a scientific natural language inference dataset containing sentence pairs extracted from papers published in five different domains. The five domains covered here are "Hardware," "Networks," "Software & Engineering," "Security & Privacy," and "NeurIPS. We build a large training set using phrases that link sentences together in scientific papers and directly use potentially noisy sentence pairs during training. The test and development set includes manually annotated sentence pairs to produce high-quality evaluation data.
To assess the difficulty of MSciNLI, we use BILSTM-based models. We also finetuned four pre-trained language models, BERT, SCIBERT, ROBERTA, and XLNET, and prompted two large language models, LLAMA-2 and MISTRAL, with zero and fourshots to establish a baseline for MSciNLI lines are established. In addition, we evaluate performance under domain shift at test time and provide a comprehensive analysis of the generalization performance of scientific natural language inference models.
Building MSciNLI, a Diverse Scientific Natural Language Inference Benchmark
Here we present the data sources, construction methods, and statistics forMsciNLI, which is collected from papers published in the four categories of the ACM Digital Library (Hardware, Networking, Software and Its Engineering, and Security and Privacy) and from papers published in NeurIPS . papers and from papers published in NeurIPS. The table below provides a sample of sentence pairs extracted from these five domains.

It also introduces a data extraction and automatic labeling procedure based on "Distant Supervision" proposed by Sadat and Caragea in 2022. Sentence-to-sentence linking phrases (e.g., "therefore," "accordingly," "in contrast," etc.) are used toautomatically annotate a large (potentially noisy) training set withnatural language inferencerelations.A list oflink phrases and theirmappings tonatural language inferencerelations is also shown in the table below.

The procedure firstextracts pairs of adjacent sentences from the papers collected from the five domains such that for the classes of implication/ENTAILMENT, inference/REASONING, and contradiction/ CONTRASTING, the second sentence begins with a linking phrase. For each extracted sentence pair, the class is assigned according to the link phrase that begins the second sentence. For example, if the second sentence begins with "therefore" or "as a result of", it is assigned the label inference/REASONING. After assigning the label, the link phrase is removed from the second sentence to prevent the model from simply learning the superficial correlation between the link phrase and the label and predicting the label.
For the neutral/neutral class, we construct sentence pairs by extracting both sentences in a pair from the same paper in three ways: one is to pair two random sentences that do not begin with a link phrase. The other selects a random sentence that does not begin with a link phrase as the first sentence and pairs it with the second sentence in the random pair that belongs to one of the other three classes. The last method is to choose a random sentence that does not begin with a link phrase as the second sentence and pair it with the first sentence of a random pair that belongs to one of the other three classes.
After extracting sentence pairs from all four classes, we randomly partition them on a paper-by-paper basis into a training set, a test set, and a development set. At this time, we ensure that sentence pairs extracted from a particular paper are included in a single set. Automatically annotated samples are directly used to train the model.However, due to the use of Distant Supervision during the construction of the training set, label noise may occur if the relationship between sentence pairs is not accurately captured by the linking phrases. Therefore, to ensure realistic evaluation, sentence pairs in the test and development sets were manually annotated by a human annotator to one of four scientific natural language inference relationships.
Threeannotatorsannotated theMSciNLI test and development set.A subset of random and class-balanced sentence pairs from the test and development setare passed to theannotatorswith instructions toannotatelabels (relationships between sentences) based on the context available in the two sentences of each sample.Ifthe annotatorcannot determine the label based on the two sentences in the sentence pair, it is instructed to mark it as imprecise. Each sample isassigned a gold label based on theannotator's majority vote.If theannotatorscannot agree among themselves (about 3%), no gold label is assigned. Samples whose gold labels match the automatically assigned labels based on the link phrases are included in each split; all others are excluded.
For each domain, random sampling (no replacement) was continued and manually annotated until the test set contained at least 800 clean samples (200 from each class) and the development set contained 200 clean samples (50 from each class). In total, 6,992 samples were annotated, of which 6,153 were matched with gold labels and automatically assigned labels. In other words, overall, MsciNLI has 88.0% matches.
To ensure data evenness, the number of samples for each class in each domain was downsampled to 200 for the test set and 50 for the development set. As a result, the test setcontains4,000samplesand the development setcontains1,000samples.The same procedure was used forthe training set toensure data equality
Next are the statistics for MSciNLI. In the table below,we compare the statistics of MsciNLI to those of SciNLI, showing thatthe total number of samples (<assumption, hypothesis> pairs) inMsciNLIis larger than in SciNLI, theonlynatural language inferencedata set forscientific papers. Furthermore, each domain of MSciNLI contains a large number of samples in the training set.

As with SciNLI, the Stanford PCFG Parser (3.5.2) is used to parse the sentences in the dataset. As shown in the table above, approximately 94% of the sentences in MSciNLI have an "S" root, indicating that most sentences in the dataset are syntactically complete.The table also shows that the overlap rate between premise and hypothesis words in each pair of MSciNLI is also low and close to that of SciNLI. Thus, like SciNLI, the MSciNLI dataset is not vulnerable to exploitation of superficial lexical cues.
Evaluation of MSciNLI
The MSciNLI evaluation consists of three stages: first, we evaluate difficulty using BiLSTM models; second, we build baselines using four pre-trained language models and two large-scale language models and compare their performance with humans; third, we compare the performance of the training set We examine performance when fine-tuning with various subsets and under domain shift, and analyze the performance of the best baselines.
The results of the difficulty assessment using the firstBiLSTMmodel are shown in the table below, whichcompares the performance of this model in MSciNLI and SciNLI. we can see that MSciNLI is a more challenging data set than SCINLI. the BiLSTM model's Macro F1 score in SciNLI is 61.12% compared to only 54.40% in MSciNLI. These resultsindicate thatMSciNLI provides a broader challenge for the model than SciNLI,making thescientific natural language inferencetask more difficult.

The second pre-training language model and the large-scale language model are used to establish a baseline. Here the base variants of the four pre-trained language models are fine-tuned with the integrated MsciNLI training set: four pre-trained language models,BERT (Devlin et al., 2019), SciBERT (Beltagy et al., 2019), RoBERTa (Liu et al., 2019b), and XLNet (Yang et al., 2019) are used. Each experiment was run three times with different random seeds to calculate the mean and standard deviation of the Macro F1 scores by domain and overall. The results are shown in the table below.

SciBERToutperforms BERT in all domains;SciBERTis trained using the same procedures as BERT, but is pre-trained using scientific papers, whichmay help improve performance inscientific natural language inference.In addition,RoBERTaandXLNetare designed to address the weaknesses of BERT, and both perform significantly better than BERT in all domains. In particular,RoBERTaconsistentlyoutperformsXLNetand outperformsSciBERT.
Next, we evaluate two large-scale language models as baselines. LLAMA-2 (Touvron et al., 2023) and MISTRAL (Jiang et al., 2023) are used here. Specifically, we use Llama-2-13b-chat-hf with 1.3 billion parameters and Mistral-7B-Instruct-v0.1 with 700 million parameters.
The paperprovides three choice question templates (shown below) forscientific natural language reasoningtasks.
- PROMPT-1: Given a pair of sentences, have a large-scale language model predict the class with four class names as choices.
- PROMPT-2: Provide further context about the scientific natural language inference task to the large-scale language model, define the class of scientific natural language inference, and then have the class predict the class by using the class name as a choice.
- PROMPT-3: Use the class definition as a direct alternative.

The paper alsoevaluates the performance of the large-scale language model in two settings: zero-shot andfourshot. The domain-specific and overall Macro F1 scores for each experiment are shown in the table below. Note thatthe results in zero-shot and fourshot for each prompt are denoted PROMPT - zs and PROMPT - ifs, respectively.

Results show that LLAMA-2 has the highest performance at PROMPT-3fs, with Macro F1 reaching 51.77%. This is 6.28% higher performance than the highest MISTRAL at PROMPT-1fs .
We also evaluate human performance on MSsciNLI for three experts (with relevant domain background; E) and three non-experts (without domain background; NE).We estimate human performance by re-annotating a small randomly sampled subset of the test set. We compute the mean and standard deviation of Macro F1 for experts and non-experts.We compare our results to RoBERTa, thebestpre-trained language modelbaseline, and LLAMA-2 using PROMPT-3fs, the best large-scale language model baseline.

Results show that expert annotators significantly outperform non-experts. We also see that the performance of the non-experts, while lower than that of the experts, still exceeds the baseline.The performance of experts is also significantly higher than both RoBERTa and LLAMA-2. This indicates that there is significant room for improving the performance of the model.
Through these evaluations, we are able to seehowMSciNLIis an important dataset forscientific naturallanguageinferencetasksand how its difficulty and diversity affect the performance of the model.
Analysis of MSciNLI
We analyze this MSciNLI training set from various perspectives to investigate its performance. First is data cartography (Swayamdipta et al., 2020).We evaluate the MSciNLI training set byfine-tuning themodel using different training subsets selected bydata cartography. Next, we investigate the behavior of the models in domain shifts during testing. Finally,we conduct cross-dataset experiments comparing the performance of modelsfinetunedwith SciNLI, MSciNLI, and a combination of the two.For these experiments, we use the best baseline model, RoBERTa.
Data cartography is performed to characterize each sample in the MSciNLI training set by two measures: confidence and variability. Based on this characterization, three different RoBERTa modelsarefine-tunedusing a subset of the following training set.
- 33% - easy-to-learn - high confidence sample
- 33% - hard-to-learn - low confidence sample
- 33% - ambiguous - highly variable sample
In addition, to analyze the impact of hard-to-learn samples on model learning, we fine-tune the model by excluding the following two subsets from the overall training set
- 100% - top 25% hard (25%sample withlowest confidence)
- 100% - top 5% hard (lowest confidence 5%sample)
These results are shown in the table below, wherethe model finetuned on the33%ambiguoussample (33% - ambiguous) performs the best of the other 33% subset.It can be seen thatthe "ambiguity" of the training sample isuseful for training models ofstrongscientific natural language inference.

The33% ambiguoussample (33% - ambiguous) also performs well, but the entire training set (100%) performs better. Also, removing some of the hard-to-learn samples (25% or 5%) does not produce a statistically significant difference in overall performance. In other words, all samples in the training set are important for learning the optimal model.
We further investigate the impact of domain shift on performance by training RoBERTa in one domain and testing it in another domain (outside the domain); in addition to the five domains of MSciNLI, we also include the ACL domain of SciNLI in our experiments. For fair comparison, we downsampled the SciNLI training set to the same size as the other domains and denote it as ACL - SMALL. In-domain (ID) and out-of-domain (OOD) results are shown in the table below.

Models trained in the domain (ID) perform better than models outside the domain (OOD). For example, a model finetuned on the NeurIPS training setexhibits a Macro F1 of 76.02% when tested onNeurIPS, while a model trained in another domainperforms poorly when tested onNeurIPS. This indicates that each domain's sentence pairs have unique linguistic properties that are better captured by models trained on data within the domain.
The final experiment is a cross-dataset experiment.The following four different RoBERTa models were trained and evaluated on each test set.
- SciNLI
- MSciNLI
- MSciNLI+(S) - Combination of MSciNLI and ACL-SMALL
- MSciNLI+ - Combination of MSciNLI and SciNLI
These results are shown in the table below.Under the dataset shift, the performance of SciNLI and MSciNLI is degraded. However, models fine-tuned with MSciNLI maintain relatively high performance in the off-dataset setting;modelsfine-tunedwith SciNLIperform 2.02% worse when tested with MSciNLI, while models fine-tuned with MSciNLIfine-tunedmodelsonly drop 1.34% when tested with SciNLI. This shows that data diversity helps in training models with high generalization performance.

Modelsfinetunedwith MSciNLI+also perform best on both datasets and on combinations of both.Fine tuning themodel on a large training set with diverse samplesyields better performance: the model trained with MSciNLI+(S) performs less well than the model trained with MSciNLI+, but still outperforms MSciNLI. This is an indication that the combination of datasets is also true for MSciNLI+(S).
Summary
This paperintroduces MsciNLI, a benchmark fordiverse scientificnatural language inferencederived from five scientific disciplines. msciNLI was found to be more difficult to classify than the only other relevant dataset, SciNLI. a strong baseline against msciNLIWe haveconstructed andvalidated thatthis datasetis challenging for bothpre-trainedlanguage models (PLMs) and large-scale language models (LLMs). In addition, we are conducting a comprehensive investigation of the performance of scientific natural language inference models under domain shift at test time and their use in downstream natural language processing tasks.
Experimental results show that large language models perform poorly on MSciNLI (the highest Macro F1 score is 51.77%), indicating that there is much room for further improvement. In addition, the prompt design has a significant impact on performance, and further exploration of other prompting strategies could lead to improved performance. 
The authors intend to focus future efforts on prompt design to improve the performance of large-scale language models in scientific natural language reasoning.
Categories related to this article







 
   ![Libra] A New Multimo](https://aisholar.s3.ap-northeast-1.amazonaws.com/media/February2025/libra-520x300.png) 
  
  
  
  
 