Catch up on the latest AI articles

Automate The Highly Difficult

Automate The Highly Difficult "summary Performance Evaluation" With LLM's In-context Learning

Large Language Models

3 main points
✔️ There are many evaluation axes for sentence summarization, and it is hard to evaluate them
✔️ Conventionally, a large dataset is required to perform multidimensional evaluation based on many evaluation axes

✔️ Proposal eliminates the need for large datasets by using in-context learning of large language models

Multi-Dimensional Evaluation of Text Summarization with In-Context Learning
written by Sameer JainVaishakh KeshavaSwarnashree Mysore SathyendraPatrick FernandesPengfei LiuGraham NeubigChunting Zhou
(Submitted on 1 Jun 2023)
Comments: ACL Findings '23

Subjects: Computation and Language (cs.CL)


The images used in this article are from the paper, the introductory slides, or were created based on them.


How do you usually evaluate a text?

For example, the key to clear writing is the 3Cs: Correct, Clear, Concise: be accurate, clear, and concise. What about literature, on the other hand? For example, the ancient text of the Tale of Genji tends to be long and contains many abbreviations, which may not be 3Cs, but the complex human relationships and skillful psychological descriptions have made it a literature that is highly regarded worldwide.

Thus, there are many evaluation axes, not limited to sentences. Large-scale language models allow us to generate a large number of sentences in a matter of seconds, but how should we evaluate these sentences? For the further development of natural language generation, it is necessary to correctly evaluate sentences based on multidimensional evaluation axes that meet the needs of society.

The paper described here is a study that simplifies the multidimensional evaluation of such sentences using in-context learning of large-scale language models. In-context learning is the ability of a large-scale language model to learn how to answer subsequent questions by adding only a few examples to the model's input.

The hypothesis that this in-context learning can be used to evaluate the desired text without giving a large data set, which was previously necessary, is tested using summarization as an example. We will now explain the problem setup, the structure and effectiveness of the proposed method, and the results of the validation.


In this paper, we refer to the method of generating sentences as Natural Language Generation (NLG). This section describes the problem setup for evaluating these NLG-generated sentences.

Let x be the input sequence to the NLG and y be the output sequence of the NLG. For sentence summarization, we can think of x as the original sentence and y as the sentence after summarization.

Some evaluation frameworks output a score s representing the quality of y. The score calculation may or may not use human-generated references r.

In the multidimensional evaluation covered in this paper, y is evaluated with d quality indicators. When there is a single quality indicator, s is a scalar, but when there are multiple quality indicators, the d-dimensional vector S=(s1,s2,... sd). So we call it a multidimensional evaluation. It is an ostentatious way of calling it, but it is simply evaluating a sentence with multiple indicators.

In this paper, we evaluate y on four dimensions: Consistency, Relevance, Fluency, and Coherence. Coherence is the quality of the structure of multiple sentences.

The problem setup is to automatically generate this 4-dimensional evaluation for the NLG summary results.

Structure of the proposed method

In order to obtain automatic evaluation results for the four dimensions of Consistency, Relevance, Fluency, and Coherence for summary results, this paper proposes a method, In-Context learning-based Evaluator (ICE), a method for evaluating summary sentences by providing the prompts (input sentences to a large-scale language model) shown in Figure 1.

Figure 1. proposed method ICE (In-Context learning-based Evaluator)

The blue part of the figure shows an example: Text is the original text, Summary is the summarized text, and Consistency is the result of the evaluation of the Consistency index (to be answered by a number). This Text and Summary are the input of the summary evaluation method, and Consistency is the output of the summary evaluation method, directly teaching the appropriate output for the input. Since this set of Text, Summary, and Consistency can be said to be a single set of teacher data (called in-context examples in this paper), in this example, two in-context examples are given and asked to evaluate a certain evaluation axis for a certain text and summary pair (in this paper, they come at the end of the prompt, (In this paper, the example that comes at the end of the prompt and to which the evaluation results are to be answered is called the test example).

Technical Point 1. using in-context learning of large-scale language models

Large-scale language models have a capability called in-context learning. In-context learning is the ability to add a few examples (in-context examples) to the input of a large-scale language model and have it answer subsequent questions according to those examples. In this paper, we take advantage of in-context learning in a large-scale language model, the GPT-3 text-davinci-003 model.

Technique Point 2: Create in-context examples by dimension of multidimensional evaluation, using words with condensed meanings as tags.

In this case, as the problem is set up, what we want to do is to assign evaluation results from multiple perspectives given the original and summarized text.

This was the technical point 1 of using in-context learning for this purpose. The factor that affects the performance of in-context learning is the content of the prompt.

The paper does not explain the design philosophy behind the adoption of this prompt, such as trying various other prompts and not being able to evaluate them well without it. In order to gain a better understanding, I tried to imagine what in-context learning is like and whether this prompt is appropriate.

Suppose that in-context learning is a process that, when a sentence of the form input: xxx output: yyy input: zzz output: is entered into a large-scale language model, it finds a rule that says that this is the kind of output that should be done for the input, and fills in the words following output: in accordance with the rule for the unknown input. In other words, the model fills in the content that follows the input. In other words, if the function can truly find rules according to what follows input and what follows output, then inorganic tags such as input and output would be acceptable.

However, since that is not the case with this prompt, I think we can consider that to be the point.

First, the proposed method gives the words Text, Summary, and Consistent, which are not inorganic and easily understood by humans, as in the example in the figure. It seems to me that this may allow the part of Summary about Consistent with respect to Text to be noticed and to focus on its regularity. If it were input and output, it seems to me that it would be more difficult to find the rules by trying to find the regularity of the entire input and the entire output.

Second, the proposed method uses the word CONSISTENT for the evaluation indicator, not simply OUTPUT. If we were to write "evaluation indicator 1," the word evaluation indicator would seem to increase the probability that a value would apply. However, it seems to give only as much suggestion as writing output: as to what value should be used. To give more information to the contrary, I think it would be good to include a definition of consistent, but there is a concern that the large language model may not be able to handle it well because of the information overload. In that sense, it may work to create in-context examples by following the tag with words that condense the meaning.

Third, the proposed method does not make the output evaluate multiple dimensions at once, but rather creates in-context examples and test examples for each evaluation axis, and executes each evaluation. Since having a large language model evaluate multiple dimensions at once is like increasing the number of input boxes to be filled in, it can be imagined that response control will become more difficult, so it seems that output will be more stable if only one dimension is evaluated at a time. In addition, if the dimensions are to be evaluated independently to begin with, it would be a reasonable choice to evaluate one dimension at a time, since it would be noise for the evaluation results of the other dimensions to affect the evaluation of the other dimensions.

Technique point 3. Select a few examples in context according to a distribution.

We have explained that the strength of in-context learning is that it can learn with only a few examples, but there is also the aspect that it can only learn with a few examples. In this paper, GPT-3 was used, but the maximum amount of sentences (context window size) that could be entered into GPT-3 at one time was four. Therefore, in order to narrow down the number of in-context examples, this paper uses the method of sampling four from a pool of hoarded in-context examples.

The sampling methods proposed are uniform sampling and stratified sampling.

Uniform sampling, in which all samples are sampled with equal probability, is intended to reproduce the distribution of examples in the pool. This being said, it seems that the way the examples in the pool are created may affect performance, but there is no particular explanation of how the examples in the pool are created.

Stratified sampling is a technique that divides the examples in the pool into four groups of scores: large, upper-middle, lower-middle, and small, and samples one from each group. The intent is to sample representative points from each score range. At a minimum, this appears to be a reasonable process, since to understand the differences in scores, we need to be given examples with different scores; I was curious if four groups would be sufficient, but a review of the appendix to this paper indicates that the correct person would be rated on a five-point scale. Ideally, it would seem that there should be five groups.

The validation results in this paper include a comparison of the distribution of scores between the human evaluation method and the proposed method. Compared to the distribution of human scores, the output results of the proposed method tended to be closer to the distribution of human scores, although the frequency tended to be biased toward a certain range of score values. Uniform sampling is basically recommended in this paper because it is closer to the distribution of human scores and more stable than stratified sampling.

Advantages of the proposed method

The two advantages of the proposed method ICE in this paper are no training required and scalability.

No learning required

Although it is misleading to speak of learning all together, the proposed method does not require supervised learning fine tuning or large data sets. What is needed are a small number of in-context examples given during inference of a large language model.


The proposed method has an extensibility that allows us to increase the number of evaluation axes by simply providing in-context examples that correspond to the evaluation of a new dimension, if we want to evaluate a new dimension.

Verification Results

Table 1 shows the comparative results of evaluating NLG's summarization capabilities with the proposed method.

Table 1. summary statement evaluation results of NLG (Model) by each evaluation method (Metric)

The comparison methods (summary statement evaluation methods) are Human, which is evaluated by humans, ROUGE-L, which is an existing automated method, BARTSc. and ICE, which is the proposed method.

ROUGE-L is a method that compares the summary text to be evaluated with a reference summary text created by a human and looks at the length of the longest word sequence that appears in common, and evaluates the longer the longer, the better the quality. The more common parts, the higher the quality.

BARTSc. is BARTScore, a method that treats sentence evaluation as a sentence generation task; BART is one of the methods that outputs the word with the highest probability of coming next given a word, and the probability calculated by the probability model is the score. For example, a quality score is given to a summary sentence that maximizes the probability of generating that summary sentence given the original sentence.

The three NLGs evaluated by this comparison method are GPT-3, BRIO, and T0. GPT-3 is a large-scale language model developed by Open AI; BRIO is an NLG that intentionally learns to generate a variety of summary sentences from a single sentence and selects the best summary sentence for output; and T0 is a smaller model than GPT-3 but is considered to be comparable to GPT-3. T0 is an NLG that is a smaller model than GPT-3, but is said to have capabilities comparable to GPT-3.

The results of the evaluation show that the NLGs in order of best score in the human evaluation are GPT-3, BRIO, and T0. The rows with red, colorless, and blue background colors in Table 1 indicate the top, middle, and bottom scores, respectively. In other words, the comparison methods with this same color sequence are consistent with the human rating order.

Among the evaluation results, only ICE, the proposed method, has the same color sequence as the human evaluation. In other words, the evaluation accuracy of ICE is shown to be high.

On the other hand, ROUGE-L and BARTSc. rank GPT-3 in third place, which is out of sync with the human evaluation.

Of the methods compared in this study, ROUGE-L and BARTSc. gave low ratings to the summary sentences generated by GPT-3, unlike the human evaluation. This paper speculates that the reason for this is that most of the existing methods evaluate based on reference summary sentences created by humans, and if they do not resemble the reference sentences, they cannot obtain a high evaluation. More specifically, even if it does not resemble the reference text, people may not be able to understand the quality of the preferred summary text.

On the other hand, the proposed method is an evaluation based on a large-scale language model that has not been trained with human-made reference summary sentences. It is considered that the proposed method is capable of human-like evaluation without being pulled by human-made reference summary sentences.

At the end

In this issue, we described a method for automatically evaluating machine-generated summary sentences from multiple perspectives. The idea was to automatically evaluate summary sentences by so-called prompt engineering, which uses in-context learning of a large-scale language model to evaluate them.

Conceptually, the technique seemed to fall into the category of an existing technology called Few-shot prompting in prompt engineering, but it seemed like a very easy and practical idea.

This prompt may seem to be a straightforward and obvious prompt for what you want to do if you only read the method description, but I also imagine that a simple prompt that may seem obvious has been found because of careful consideration.

It is likely that there will be more papers in the future that actually try out large-scale language models because it seems possible to do so by utilizing them, and the findings of such papers will be helpful in practice because they are easy to try out.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us