Towards Automating Scientific Paper Reviewing?

Natural Language Processing 05/03/2021

3 main points
✔️ An ambitious attempt to automatically generate reviews for scientific papers.
✔️ A new dataset with a collection of 20000+ reviews of various scientific papers: ASAP-Review.
✔️ An open-source system to automatically generate reviews of research papers.

Can We Automate Scientific Reviewing?
written by Weizhe Yuan, Pengfei Liu, Graham Neubig
(Submitted on 30 Jan 2021)
Comments: TLDR: This paper proposes to use NLP models to generate first-pass peer reviews for scientific papers.
Subjects: Computation and Language (cs.CL)

code：

Introduction

Currently, countless scientific papers are being published on a daily basis in the scientific community and more so in the Artificial Intelligence community. It is extremely difficult to find the papers that are relevant to your interests from the pile of papers available. This poses a real challenge for the rapidly advancing scientific community by slowing down peer review, which is essential to validate the ideas presented in any paper.

"This paper proposes to use NLP models to generate reviews for scientific papers. The model is trained on the ASAP-Review dataset and evaluated on a set of metrics to evaluate the quality of the generated reviews. It is found that the model is not very good at summarizing the paper, but it is able to generate more detailed reviews that cover more aspects of the paper than those created by humans. The paper also finds that both human and automatic reviewers exhibit varying degrees of bias and biases and that the system generates more biased reviews than human reviewers."

The entire second paragraph was the review generated by the system for this paper. Surprised? Hang on to learn more about the system.

What is a GOOD review?

A good review might exhibit several objectives (Ex. Uses correct factual information from the paper) and subjective qualities(Ex. Unbiased interpretation). This makes the task of defining a good review difficult. The paper uses four major criteria to quantify a good review: Decisiveness, Comprehensiveness, Justification, and Accuracy.

Our goal will be to evaluate the quality of a review R(generated manually or automatically) of a paper D using its meta-reviews R^m. (Summary of actual reviews for a paper). For this two functions are defined: DEC(D) ∈{-1,1}meaning {'accept', 'reject'} which is the final result of the meta-review. Another function REC(R) ∈ {-1,0,1} meaning {'accept','neutral,'reject'} represents the acceptance of the paper.

Let us look at each of the criteria and how they are evaluated in detail.

1) Decisiveness

A good peer review takes a clear stance and only praises worthy papers and rejects the others. The degree of decisiveness is calculated using Recommendation Accuracy(RAcc). RAcc aims to measure whether the review's acceptance of the paper(REC) is consistent with the actual decision made for the paper(DEC).

RAcc(D) = DEC(D) * REC(R)

2) Comprehensiveness

Good reviews should be well organized with a short summary and evaluation of different aspects of the paper. Two metrics are used to measure comprehensiveness: Aspect Recall(ARec) and Aspect Coverage(ACov). For a review R, ACov measures how many aspects of comprehensiveness it covers. The aspects are predefined as Summary (SUM), Motivation/Impact (MOT), Originality (ORI), etc, and will be discussed later. ACov counts how many of the aspects from the meta-reviews matches the review.

3) Justification

The evaluation of the paper must be constructive and backed with proper evidence and reasons. The justification is calculated using the Info(R) metric as the ratio of the number of aspects in R with negative sentiment(n_na) and the number of aspects in R with negative sentiment supported by evidence(n_nae). The judgment of evidence is done manually and Info(R) is set to 1 when n_na=0.

4) Accuracy

The information used must be factually correct. The Summary Accuracy(SAcc) metric says how well the review summarizes the paper and takes values {0,0.5,1} for {incorrect,partially correct,correct}. These values are assigned manually by humans. Another metric Aspect-level Constructiveness (ACon) is used to evaluate the evidence provided for negative sentiments (n_na) on a review. So, just providing evidence and getting a higher Info(R) score is not enough. The evidence needs to be accurate and appropriate to get an overall higher score.

5) Semantic Equivalence

In addition to all the criteria, two more metrics are introduced to measure the semantic equivalence of the paper and the review. A greater semantic equivalence means that the content of the paper has been accurately represented in the review. The ROGUE(word-matching) and the BERTScore(the distance of word embeddings) are calculated and the maximum of the two values is taken.

The Dataset

　The ASAP-Review Dataset

In order to train the model, a new dataset was created based on the ICLR papers from 2017-2020 and NeurIPS papers from 2016-2019. The metadata information includes reference reviews(from committee members), meta-reviews(written by senior committee members), accept/reject decisions, and other information like URL, authors, comments, subject, etc.

Human and Automatic Aspect Labels for Positive and Negative Sentiments.

Each review in the dataset is annotated with aspects i.e. predefined labels like Summary (SUM), Motivation/Impact (MOT), Originality (ORI), Soundness/Correctness (SOU), Substance (SUB), Replicability (REP), Meaningful Comparison (CMP) and Clarity (CLA). At first, 1000 reviews were annotated by hand. Then, a BERT model was fine-tuned using these 1000 reviews and the model was used to annotate the remaining 20000+ reviews. Finally, a random 300 reviews were sampled and their annotations were checked by humans. The results are shown below.

The low (50%) recall of replicability from positive sentiment can be attributed to the lower number of examples. Besides, other values are satisfactorily high.

Training for Scientific Review Generation

We used a pre-trained BART model to generate scientific reviews. BART accepts a maximum length of 1024 words which is short for most scientific papers. So, a two-stage method was used after testing with various methods. In this two-stage method, we first extract important information from the paper using Oracle and Cross-Entropy(CE) Extraction methods. Then, the extracted information is passed through the model.

In addition, we also make use of the annotation in the ASAP-Review dataset and formulate a classification problem to predict the aspect labels. So, the loss function is given by

_Lnet = _Lseq2seq + _kLseqlab

Here, k(=0.1) is a hyperparameter tuned to make the generated reviews more aspect-aware during development. Lseq2seq is the logit loss in predicting the next word and Lseqlab is the logit loss in predicting the labels of the next word.

Evaluation

† denotes cases when the difference between human and model performance is statistically significant.

The above table shows a comparison of the model's performance and human performance. The model generates highly comprehensive results even outperforming humans. These models are also really good at summarizing the paper.

However, as one might expect, these models do not question the contents of the paper like a human reviewer would do and tend to imitate frequent occurrences in the training set("this paper is well-written and easy to follow" was repeated 90% of the time). Because of the insufficient context within a single paper, it also fails to distinguish between good and bad quality papers.

The paper ( Deep Residual Learning for Image Recognition ) of the Review generation results

Conclusion

The model introduced in the paper is definitely usable for several tasks even though it is not yet ready for professional use. For instance, it could be helpful to young and inexperienced researchers who are just learning about the scientific reviewing process. Also, the ASAP-Review dataset contains reviews exclusively from the machine learning domain and needs to be expanded to accurately cover other fields. Although there is a long way to go before we reach human-level performance, this paper has succeeded in establishing a strong foundation for future works. For more details on the model and dataset please refer to the original paper. Get your paper reviewed automatically.

Categories related to this article

Thapa Samrat: I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

Towards Automating Scientific Paper Reviewing?

Introduction

What is a GOOD review?

1) Decisiveness

2) Comprehensiveness

3) Justification

4) Accuracy

The Dataset

Training for Scientific Review Generation

Evaluation

† denotes cases when the difference between human and model performance is statistically significant.

Conclusion

CLAP-IPA: Acquisition Of Multilingual Phonetic Expressions By Contrastive Learning Of Speech And IPA Sequences

CLAP-IPA: Acquisition Of Multilingual Phonetic Expressions By Contrastive Learning Of Speech And IPA ...

What Is A Good Vocabulary In Machine Translation?

What Is A Good Vocabulary In Machine Translation?

When Should We Believe In LLM?

When Should We Believe In LLM?

Extracting Critical Information From Medical Documents Using InstructGPT

Extracting Critical Information From Medical Documents Using InstructGPT

Can You Have A Conversational Dialogue With A Mobile UI In A Large Language Model?

Can You Have A Conversational Dialogue With A Mobile UI In A Large Language Model?

What Is The Importance Of Pre-training On Data With Expertise? ~ Application Of BERT To The Classification Of Legal Documents ~.

What Is The Importance Of Pre-training On Data With Expertise? ~ Application Of BERT To The Classifi ...