Catch up on the latest AI articles

Challenges And Solutions For German Summarization Systems: Analysis Of Training Data And Existing Systems

Challenges And Solutions For German Summarization Systems: Analysis Of Training Data And Existing Systems

Computation And Language

3 main points
✔️ In the field of natural language processing, solutions are important not only for English, but also for other languages.
✔️ examines the need for abstract text summarization in German and explores why there are no practical solutions in the industry.

✔️ An examination of the status of abstract summarization in German shows some positive signs, including an increase in the number of publicly available systems and an expansion of data sets.

On the State of German (Abstractive) Text Summarization
written by Dennis AumillerJing FanMichael Gertz
(Submitted on 17 Jan 2023)
Comments: Accepted at the 20th Conference on Database Systems for Business, Technology and Web (BTW'23)

Subjects: Computation and Language (cs.CL)



The images used in this article are from the paper, the introductory slides, or were created based on them.


In the field of natural language processing, solutions that include not only English but also other languages are becoming increasingly important. A useful tool for companies processing large amounts of text data is a text summarization system that summarizes long documents to make them easier to understand.

This study examines the need for abstract text summarization in German and explores why there are no practical solutions in the industry. The primary focus is on training data and analysis of existing summarization systems. Studies show that popular data sets are sometimes not suitable for abstract summarization, and available systems are often ineffective compared to the baseline. Low evaluation quality is due to factors such as insufficient training data, location bias, and lack of preprocessing strategies and analysis tools.

Therefore, it is important to evaluate the model on a clean data set, which is expected to improve quality during the evaluation. It is also noted that relying solely on 𝑛-gram-based scoring methods, one of the most common methods used to assess summary quality, can be problematic.


There are two primary methods of summarization: extraction systems and abstract summary systems.

Abstract summaries have the potential to improve the fluency and conciseness of the summary by introducing new words and sentences. However, non-English summaries suffer from a lack of data and evaluation metrics.

Current summarization systems suffer from extraction and syntax errors that can lead to erroneous conclusions. Furthermore, the current pipeline does not take into account user-specified filtering procedures (user-side input methods), and there is a need to improve the quality of summary data sets. The study also focuses on and analyzes the German language as a non-English example.

Related Research

Below is a list of datasets used in previous studies.

MLSUM Multilingual dataset consisting of news articles and their summaries
assiveSumm Dataset focused on automatically extracted summaries with a structure similar to MLSUM
Swisstext Dataset providing long form summaries based on Wikipedia pages
Klexikon Dataset with articles extracted from Klexikon instead of Wikipedia
WikiLingua Dataset providing a summary of procedures extracted from WikiHow
LegalSum Dataset providing summaries of legal documents
EUR-Lex-Sum Dataset providing a summary of EU legal documents

Publicly available models such as Huggingface Hub and private models published in response to the Swisstext 2019 Summarization Challenge are options in the evaluation of German summarization systems. These models will be evaluated by ROUGE metrics, with scores such as ROUGE-1, ROUGE-2, and ROUGE-L used as indicators of performance ROUGE is a set of metrics to assess the suitability of a summary and the reproducibility of information, quantitatively measuring the quality and suitability of a summary ROUGE is a set of metrics that quantitatively measure the quality and conformance of a summary. In addition, extractive summarization services provided by cloud providers play a certain role. This reveals the performance and limitations of different approaches and models, and points the way toward developing and improving effective summarization systems.

System Evaluation Methodology

Data evaluation

The first step is to clean the data. This includes basic techniques to ensure data quality. Specifically, this includes empty samples, minimum text length, compression rate filtering, and duplicate removal. Sample inspection methods are also presented, which include sample sequential review, random sample review, and inspection of outliers and representative samples. These methods help to assess the quality of the data set and ensure reliable generalizations from the experimental results.

Model Evaluation

Several model checking methods have been proposed to evaluate the performance of the summary system. First, a cleaned test set is used to evaluate the model. This is a standard technique to verify that the trained model is not over-trained. Next, the model is tested using modified test data to investigate its generalization ability.

This method can serve as a means to examine whether a particular system is applicable to other data sets. It is also proposed to assess the quality of the summaries using specific measures applied to the system summaries. Factors such as abstractness and lexical variation in the summary are taken into account to provide a preliminary assessment of the quality of the output.

Finally, the consistency of facts to be maintained within the summary is discussed. Summaries should maintain the facts of the original reference text. This may be implemented as an optimization target to evaluate the truthfulness of the summary and generate a more truthful summary.

Extraction model and baseline system

An extraction model is a technique for generating text summaries that extracts important sentences or phrases from the original text and combines them to generate a summary. The importance of a sentence or phrase is usually determined based on factors such as the frequency of words in the sentence, the position of the sentence, and the length of the sentence. Because the extraction model uses information from the original text as it is, the content of the summary is fully contained in the original document.

A baseline system refers to the reference model or algorithm in a given task or problem. Typically, a baseline system represents the simplest or existing basic method in that task or problem. A baseline may be used as a basis for subsequent refinement or evaluation of new methods. In the case of text summarization, a baseline system is a simple method, for example "read-3," which uses the first few sentences as a summary.



MLSUM and MassiveSumm are datasets used for training for summary generation. These datasets contain information specific to the summarization task. However, these datasets may contain low quality or inappropriate examples. Therefore, it is common practice to filter out these examples from the training dataset.

Filtering may change the distribution of the training data set. This means that the nature and characteristics of the entire data set may change. Statistical measures such as means and quartiles are used to visualize this distribution shift. These indicators help to summarize the characteristics of the entire data set and indicate changes.

Therefore, by looking at the changes in mean and quartiles indicated by the black dashed lines, one can understand the distribution shift in the dataset due to filtering. Such an analysis is helpful in evaluating the quality and performance of the training data set and in selecting a suitable data set for training the model.

Results and Baseline Runs

EVA is a platform for objectively evaluating and comparing model performance, but it has been pointed out that it is difficult to reproduce the evaluation results of published models. It has been pointed out that it is difficult to reproduce the evaluation results of published models. Furthermore, there is a significant difference between self-reported scores on the test set and actual scores, highlighting the issue that only idiosyncratic results can reproduce the expected scores.

The reproducibility of ROUGE assessment metrics is also a matter of debate. In particular, there is a need to clarify the context of the evaluation based on ROUGE scores using different baseline approaches. Efforts to improve the reproducibility of model evaluations with EVA are needed to address these issues.

Result after filtering

For the MLSUM and MassiveSumm test sets, ROUGE-1 scores can drop below 20 after filtering. In particular, for the MLSUM data set, we observed that the t5-based model trained on the filtered data set performed better than before filtering. MassiveSumm, on the other hand, tends to have a significantly different length distribution and is affected by the extraction filter. These findings prompt discussion about how filtering affects the current state-of-the-art and suggest that a more complete evaluation may be possible by combining different evaluation methods.

Qualitative analysis

We found that no publicly available systems have experimented beyond simply calculating ROUGE scores. In some systems, despite the high scores reported, fatal failures may be observed. We also found that all of the architectures used work only in a relatively limited context and cannot handle long form summaries. These insights suggest that the practical suitability of the model cannot be demonstrated. Further investigation into the quality of the system's output showed that while summaries can deviate significantly from the original, they can also lack accuracy and truthfulness of content, and rarely provide consistent sentences.


Examining the status of German abstract summarization, there are some positive signs, such as an increase in the number of publicly available systems and an expansion of data sets. However, many challenges still remain, most notably data quality and the ability to generalize models. To address these challenges, an exploratory data-centric approach and ethical considerations are important. In addition, the development of a non-independent training framework and the design of systems that can be applied to multiple domains are needed. Looking toward the future, community collaboration and effort are critical. It is hoped that this will result in more sophisticated abstract summarization systems and expand the scope of their applications.


If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us