Catch up on the latest AI articles

Can Large-scale Language Models Replace Humans In Text Evaluation Tasks?

Can Large-scale Language Models Replace Humans In Text Evaluation Tasks?

Large Language Models

3 main points
✔️ Examine the usefulness of large-scale language models for evaluating text quality
✔️ While large-scale language models can also evaluate text quality as well as humans, and are highly reproducible and fast, they have problems such as factual misunderstandings and lack of sentiment.
✔️ Large-scale language models do not completely replace human evaluations and may be most effective when used in combination.

Can Large Language Models Be an Alternative to Human Evaluations?
written by Cheng-Han ChiangHung-yi Lee
(Submitted on 3 May 2023)
Subjects: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
Comments: ACL 2023 main conference paper


The images used in this article are from the paper, the introductory slides, or were created based on them.


An important way to evaluate the performance of natural language processing models and algorithms is through "human evaluation. Due to the nature of natural language, there are aspects that are difficult to evaluate with automatic evaluation metrics. For example, creative texts such as poems and stories, or texts containing irony or metaphors are difficult to evaluate their meaning and value using algorithms. In such cases, human evaluation is necessary to assess the quality of the output of the natural language processing model. This approach is commonly used in the domain of natural language processing.

However, human evaluation also has its problems. For example, people may rate the same text differently for the same text due to a number of factors, including the subjectivity of individual raters or differences in their interpretation of the evaluation criteria. There is the problem that reproducibility cannot be guaranteed.

Therefore, the paper presented here proposes the use of "large-scale language models" as a new evaluation method to address this reproducibility issue. Large-scale language models are models trained to model human language. They are trained using large amounts of textual data accessible on the Web, and as a result, they learn how to use a person's language. In other words, evaluating with a large-scale language that captures human characteristics has a high affinity with evaluating with an evaluation by a human, and has the potential to achieve higher reproducibility and stability than with a human.

In the paper presented here, we examine whether "evaluation by large-scale language models" can replace "human evaluation" in several tasks.

Task 1: Open-ended story generation

To confirm the usefulness of the evaluation with a large-scale language model, we are testing it with a task called "open-ended story generation". Open-ended story generation is a task that generates short stories based on given prompts. In this task, the quality of stories generated by a human and a generative model (GPT-2) are evaluated by a large-scale language model and a human to verify whether the large-scale language model can rate human-written stories higher than those generated by the generative model.

This task was preceded by an Amazon Mechanical Turk worker who could not distinguish between GPT-2-generated and human-written stories, while English teachers rated the human-written stories as better than the stories generated by the generative model (GPT-2). We refer to the study.

Note that we use the "WritingPrompts" dataset (Fan et al., 2018) for this task. WritingPrompts" is one of the subreddits (topic-specific discussion boards) on the popular online community site Reddit, where users post short prompts and other users write short stories or essays based on the The exchange takes place in the form of users posting short prompts and other users writing short stories or essays based on the prompts. The WritingPrompts dataset is a combination of these prompts and the stories they generate.

The task evaluation method is carried out as shown in the figure below. First, a questionnaire (evaluation instructions, generated story fragments, and evaluation questions) is prepared and rated on a Likert scale (5 levels) based on four different attributes (grammatical accuracy, consistency, liking, and relevance), respectively. For the human evaluation, the user responds to the prepared questionnaire as is, and for the evaluation by the large-scale language model, the user inputs the questionnaire as a prompt and obtains the output by the large-scale language model.

Four large language models are used: T0, text-curie-001, text-davinci-003, and ChatGPT. Both text-curie-001 and text-davinci-003 are InstructGPT models, with text-davinci-003 being the stronger model. In addition, since human evaluation based on past research is considered to be unreliable, we did not use Amazon Mechanical Turk and asked three English teachers to use the freelancer platform "UpWork". gain. These large-scale language models and English teachers evaluated 200 stories written by a human and his 200 stories generated by GPT-2.

Open-ended story generation validation results

The results of the validation are shown in the table below. Human ratings (by English teachers) indicate a preference for human-written stories. English teachers rated human-written stories higher than GPT-2-generated stories on all four attributes (Grammaticality, Cohesiveness, Likability, and Relevance). This indicates that English teachers (experts) can distinguish the difference in quality between stories written by the generative model and those written by humans.

In addition, T0 and text-curie-001 show no clear preference for human-written stories. These large-scale language models indicate that they do not significantly distinguish quality differences between human-written stories and stories written by generative models. This indicates that the large-scale language models are not as competent as human experts in evaluating open-ended story generation. On the other hand, text-davinci-003 shows a clear preference for human-written stories as well as English teachers. This large-scale language model has been shown to rate human-written stories higher than stories written by the generative model on all attributes and is statistically significant; ChatGPT has also been shown to rate human-written stories higher in preference and is also statistically significant. In addition, ChatGPT can also provide detailed reasons for its ratings. IAA in the table is the inter-annotator agreement.

Task 2: Hostile Attack

In this task, we are examining a task that tests the AI's ability to classify sentences. Specifically, we go from a situation where the AI can accurately classify a sentence (e.g., it can correctly identify whether the sentence has a positive or negative meaning) to some kind of hostile attack (e.g., using synonyms to slightly change the sentence). We then evaluate how the attack affects the AI's ability to classify the sentences. This evaluation is performed by a large-scale language model (in this case, ChatGPT) and a human, respectively, and the results are compared. The sentences are evaluated in terms of whether they are natural and fluent (Fluent) and whether the original meaning of the sentence is retained (Mean.). We also use Textfooler, PWWS, and BAE as adversarial attack methods. These are used to attack trained AI models (in this case, the BERT-base-uncased model used to classify news article titles).

Hostile Attack Verification Results

The validation results are shown in the table below, where Benign represents non-adversarial attacks and Textfooler, PWWS, and BAE represent adversarial attacks. English teachers (Human evaluate) rate sentences produced by hostile attacks lower than the original sentences in terms of fluency (Fluent) and preservation of meaning (Mean.). This is consistent with the lower quality of sentences produced by hostile attacks reported in recent studies.

The next step in the evaluation results of the large-scale language model (LLM evaluate) is to first verify that the large-scale language model understands the task. It has them evaluate the conservation of meaning (Mean.) of the exact same sentence, and ideally, the large-scale language model would always give a score of 5 (completely agree). The result of this validation is 5.00, indicating that ChatGPT understands the task.

Then, looking at the results of the large-scale language model's evaluation of hostile-attack sentences, ChatGPT tends to give higher ratings to hostile-attack sentences than English teachers, but ChatGPT also rates hostile-attack sentences lower than the original sentences, and overall, the large-scale language models are able to evaluate the quality of hostile-attack sentences and original sentences in the same way as humans.


The paper proposes the use of large-scale language models as an alternative to human evaluation of text quality and validates their usefulness in two tasks: "open-ended story generation" and "adversarial attack". And as a result of the validation, the paper identifies the following four advantages of evaluation by large-scale language models.

  1. Reproducibility: In human evaluation, there is variation from evaluator to evaluator, but in evaluation by large-scale language models, reproducibility can be improved over humans by specifying the model, random number species, and hyperparameters.
  2. Independence: In human evaluation, the next sample's evaluation may be influenced by the previously viewed sample, but in evaluation by a large-scale language model, each evaluation is independent and is not influenced by the previous sample.
  3. Cost efficiency and speed: Evaluation by large-scale language models is less costly and faster than human evaluation.
  4. Reduced exposure to objectionable content: avoid the discomfort caused by human evaluation of inappropriate content.

On the other hand, there are limitations and ethical issues associated with evaluation by large-scale language models. Large-scale language models are generally susceptible to misinterpretation of facts, and learning can introduce biases. Furthermore, they lack the ability to interpret visual cues, and thus cannot interpret tasks in exactly the same way as humans. They may not have emotions, which may reduce their usefulness in the evaluation of emotion-related tasks. Human ratings and ratings from large-scale language models each have their advantages and disadvantages, and they are likely to be most effective when used together.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us