What Is The Importance Of Pre-training On Data With Expertise? ~ Application Of BERT To The Classification Of Legal Documents ~.
3 main points
✔️ Apply BERT to the classification task of legal documents.
✔️ Compare the accuracy of Fine Tuning a model pre-trained with text containing legal expertise and a model pre-trained with generic text.
✔️ Also consider methods for applying BERT to longer legal documents that exceed the BERT limit of 512 words
Effectively Leveraging BERT for Legal Document Classification
written by Nut Limsopatham
(Submitted on Nov 2021)
Comments: EMNLP | NLLP
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
Recent developments in deep learning have contributed to improving the accuracy of various tasks in natural language processing (NLP), such as document classification, automatic translation, dialogue systems, etc. Real-world applications of NLP are very advanced, and there are many possible applications of NLP in the legal field, the topic of this paper.
A model that has received particular attention in the NLP community in recent years is BERT, a model published by Google in 2018; BERT acquires knowledge about a target language by pre-training with a large unlabeled corpus.
By fine-tuning the pre-trained models created in this way with the dataset of the task being tackled, it is possible to build models with higher performance at a lower cost than previously possible.
On the other hand, the following challenges also exist in BERT
- Inability to effectively process text longer than 512 words
- Pre-learning is expensive because it requires processing large text data sets
Therefore, in this paper, through the classification task of legal documents by BERT
- How to process long text like legal documents by BERT
- The importance of prior learning with a corpus of expertise in tasks that require expertise, such as law
Discussion regarding the
Specifically, two tasks, predicting law violations using the ECHR Violation Dataset and predicting rejection using the Overruling Task Dataset, were trained and evaluated by various BERT-based models, and the results were compared and discussed results were compared and discussed.
prerequisite knowledge
Application of Natural Language Processing Technology in the Legal Field
As already mentioned, there are various possible applications of NLP in the legal field, examples of which include
- Predicting Legal Violations
- Predicting Judgment
- Extraction of legal information
- Generation of the Court's Opinion
Therefore, datasets have been developed and the ECHR Dataset and Overruling Task Dataset used in this study are examples of such datasets. Legal documents as datasets have the following characteristics
- Described in epic sentences.
- unstructured (data)
- Contains technical words
And it is the classification of legal documents by BERT that we focus on in this study.
About BERT
In this study, we use BERT to classify legal documents; BERT is a multi-layered model of a bi-directional transformer encoder, which is modeled as a
- Masked language model prediction to predict masked words in a sentence from surrounding words
- Given two sentences, the Next sentence prediction predicts whether the second sentence is a follow-up to the first sentence
The students acquire knowledge of the language through prior learning using two tasks.
A common application of BERT is to perform fine-tuning and transfer learning of a pre-trained model on a large dataset, such as Hugging Face, for your task. This pre-trained BERT + transfer learning approach achieved SoTA on various datasets such as GLUE and SQuAD at the time.
The model used for the legal document classification task in this study has the simplest structure, BERT + linear transformation layer for classification.
The final output (classification result) is obtained by a linear transformation of the variance representation for special tokens and CLS.
Application of BERT in the field of law
BERT has already been used in various studies for law-related tasks.
Zheng et al. found that BERT models pre-trained with legal documents perform better than BERT pre-trained with generic text.
Cahlkidis presented that BERT did not give good accuracy to pure BERT when predicting law violations using a dataset consisting of sentences of more than 512 words. However, by using Hierarchical BERT, we have solved that problem.
BERT for long sentences
Using the pure BERT model, the maximum number of words in a sentence that can be processed is 512. However, there are already some BERT-based methods that can process sentences longer than that.
Beltagy and Zaheer et al. addressed this by changing the method of the Attention mechanism.
Pappagari et al. addressed the distributed representation of long sentences by applying max pooling and mean pooling to aggregate them into distributed representations of less than a certain length and input them to BERT. In this study, we applied these methods to the classification of legal documents and tested their performance. This study verified the performance of these methods in the classification of legal documents.
Classification of legal documents by BERT (experimental setting)
This chapter explains the experimental setting of the legal document classification task that we worked on throughout our research.
Dataset used.
ECHR Violation Dataset (Multi-Label)
This dataset asks the question: which rules in the European Convention for the Protection of Human Rights and Fundamental Freedoms are violated by a particular case or cases? The task is to predict which rules in the European Convention for the Protection of Human Rights and Fundamental Freedoms are violated. The number of labels is 40 and an overview of the dataset is given below.
The evaluation was performed by calculating the micro f1-score for the test data.
Overruling Task Dataset
This dataset is used for the task of predicting whether a given legal text will or will not overturn a previous ruling. The task is a binary classification, and the dataset is outlined below.
Note that for this task, we have performed a 10-division cross-validation.
Hyperparameter Optimization Functions
The hyperparameters and optimization functions used in the experiments are as follows
- Learning rate: 5e-5 & linear learning-rate scheduler
- Optimization algorithm: AdamW
- Number of batches: 16
- Number of epochs: 5
Model used
Next, we discuss the models used. We use the following four BERT-based pre-trained models in this study.
- BERT: BERT ("bert-base-uncased" in hugging face) pre-trained on generic texts such as BookCorpus and English Wikipedia.
- ECHR-Legal-BERT: BERT (structure similar to "bert-base-uncased") is pre-trained by legal documents containing ECHR Dataset.
- Harvard-Law-BERT: Pre-learn BERT (a structure similar to "bert-base-uncased") with the Harvard Law case corpus, a legal document.
- RoBERTa: RoBERTa pre-trained on generic texts such as BookCorpus and CommonCrawl News ("roberta-base" on the hugging face)
In addition, the following processes have been applied when entering long sentences into BERT.
- RR-* Model: remove tokens over 512 (leave the front part of the sentence and remove the back part)
- RF-* Model: remove more than 512 tokens (leave the back part of the sentence and remove the front part)
- MeanPool- *Model: split over 512 tokens (sentences) into 200 tokens (words) each. Input each of the 200 tokens into BERT, and the average of the output from doing so from BERT is the variance representation.
- MaxPool-*Model: Split over 512 tokens (sentences) into 200 tokens (words) each. Input each of the 200 tokens into BERT, and the maximum output from doing so from BERT is the distributed representation.
Furthermore, as a comparison with "general BERT and RoBERTa + methods for long sentences", we also trained and evaluated a model that can process longer sentences than 512 by a different Attention mechanism than them.
- BigBird: A BERT-based model that can process over 512 tokens by using various attentions such as random attention, global attention, and window attention. CommonCrawl News.
- LongFormer: A BERT-based model that can process more than 512 tokens by using various attentions such as sliding window attention, dilated sliding attention, global attention, and so on. Pre-trained on generic texts such as BookCorpus and English Wikipedia.
We applied the above-pre-trained models and laws over 512 tokens to legal documents and compared the accuracy. Now it's time to see the results.
Legal documents by BERT (experimental results)
ECHR Violation Dataset
The results of the learning and evaluation against the ECHR Violation Dataset are as follows
We will compare these results from the following three perspectives.
- Comparison between general BERT and RoBERTa models
- Comparison between methods for applying the general BERT and RoBERTa models to long sentences.
- Comparison of common BERT and RoBERTa models with BigBird and LongFormer.
First, we discuss "Comparison among general BERT and RoBERTa models ".
The highest F1 score among the four BERT models, BERT, ECHR-Legal-BERT, Harvard-Law-BERT, and RoBERTa, was recorded by ECHR-Legal-BERT, a BERT pre-trained by legal documents including ECHR Dataset which is BERT pre-trained by legal documents including the ECHR Dataset.
From this, we can say that BERT pre-trained on sentences that are highly relevant to the legal document classification task tends to produce higher accuracy than BERT/RoBERTa pre-trained on generic texts.
However, on the other hand, there exist results where RoBERTa pre-trained on generic texts achieves higher accuracy than Harvard-Law-BERT. This means that although there is an effect of pre-training with texts that know the law, it may not be enough to transcend the improvement in accuracy due to the structure of the model.
Next, we discuss "Comparison among methods for applying the general BERT and RoBERTa models to long sentences".
We compared four methods, RR-* Model, RF-* Model, MeanPool- *Model, and MaxPool-*Model, and MaxPool-*Model recorded the highest F1 score.
Finally, we compare "the general BERT and RoBERTa models with BigBird and LongFormer". From the results, we can see that BigBird and LongFormer record very high F-values compared to "General BERT and RoBERTa models + methods for applying general BERT and RoBERTa models to long sentences". This again confirms that BigBird and LongFormer are very effective methods for processing long documents.
These are the experimental results for the ECHR Violation Dataset.
Overruling Task Dataset
Let us look at the training and evaluation results for the Overruling Task Dataset. As we have already shown, unlike the ECHR Violation Dataset, the Overruling Task Dataset does not contain sentences longer than 512 words. Therefore, we do not apply the method introduced in 4.3 to apply BERT to long sentences. The results are as follows
The table shows the average F1 score calculated for each model by 10-part cross-validation. The results show that Harvard-Law-BERT and ECHR-Legal-BERT, the models pre-trained on law-related texts, produce the highest accuracy.
On the other hand, models such as LongFormer and BigBird, which had high accuracy in the ECHR Violation Dataset, have lower F1 scores than the other models. This is because LongFormer and BigBird are models specialized for long sentences, and their attention methods, such as global attention and randomized attention, havhurteir F1 scores.
These are the experimental results for the Overruling Task Dataset.
consideration
In this study, we conducted experiments on the legal document classification task using two datasets. Based on the results, this chapter discusses the following two points in particular.
- Whether prior learning with domain knowledge texts is effective in classifying legal documents?
- How can we apply longer sentences to BERT-based models in classifying legal documents?
First, regarding the first point. Regarding this, it is considered to be valid. The results table shows that among the models using general BERT and RoBERTa, both the ECHR Violation Dataset and the Overruling Task Dataset (ECHR-Legal-BERT, Harvard-Law -Law-BERT ) are the most accurate (except for BigBird and LongFormer).
It is therefore fair to say that pre-training on texts with domain knowledge is effective in classifying legal documents. However, there are cases where it is not possible to collect a sufficient amount of "textual data containing domain knowledge" to pre-train BERT. In such cases, using a model pre-trained on generic text data is a good option.
Next, let's discuss the second point. For sentences longer than 512 words, it is effective to use models such as LongFormer or BigBird. The models and methods we have tested in this study are
- Models with improved Attention to process texts with more than 512 words like LongFormer and BigBird (pre-trained with generic texts)
- General BERT and RoBERTa (pre-study with legal documents) + methods for processing long sentences
- General BERT and RoBERTa (pre-trained with generic text) + methods for processing long sentences
and that its performance is 1 > 2 > 3 when classified into three categories.
This suggests that when processing sentences longer than 512, it is effective to use models such as LongFormer and BigBird, regardless of the data used for pre-training.
And in the case of "general BERT and RoBERTa models + methods for applying general BERT and RoBERTa models to long sentences", the results show that using MaxPool and MeanPool as methods for processing long sentences gives the highest F1 score.
From the above, it is most desirable to use BigBird and Longformer methods when processing sentences related to laws longer than 512 words. As a next step, when applying general BERT and RoBERTa, MaxPool and MeanPool are considered to be the best methods to use.
summary
This is the explanation of the paper. In this paper, we have clarified "the importance of pre-training with text data containing expert knowledge" and "the effectiveness of methods such as BigBird and Longformer" in classifying legal documents using BERT.
Prior learning requires huge amounts of text data and computing resources. Therefore, it is very costly to do at the individual level. However, even if the cost is subtracted, this research reminded us of the importance of pre-training with textual data that contains domain knowledge of the task to be tackled.
We hope that pre-trained models with expertise in various fields will be developed and BERT will be applied in various fields in the future.
Categories related to this article