Longformer: Improved Version Of The Transformer That Can Handle Longer Sequences

BERT 04/09/2023

3 main points
✔️ Presented a solution to the problem of efficient processing of long sequences
✔️ Reduced Transformern computational complexity using three Attentions: Sliding Window Attenion, Dilated Sliding Window Attention, and Global Attention.Reduced the computational complexity of Transformern
✔️ Improved accuracy for tasks with long sentences as input

Longformer: The Long-Document Transformer
written by Iz Beltagy, Matthew E. Peters, Arman Cohan
(Submitted on 10 Apr 2020 (v1), last revised 2 Dec 2020 (this version, v2))
Comments: Version 2 introduces the Longformer-Encoder-Decoder (LED) model
Subjects: Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

Longformer is an attempt to deal with the problem that the computational complexity of Transformer's self-attention is $O(n^2)$ and that memory usage increases quadratically with the increase in input when a long sentence is inserted.

What is the problem with Transformer?

The computational complexity of a transformer increases quadratically with the input sequence. This results in very long computation times and memory usage.

The reason why Transformer's computational complexity increases quadratically with input sequences is due to Scaled Dot-Product Self-Attention, the main component of Transformer. To begin with, Scaled Dot-Product Self-Attention calculates Attention using query and key-value pairs. In other words, the formula for calculating Scaled Dot-Product Self-Attention uses the query, key, and value ($Q,K,V$) to compute

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}}V)$$

The product of the query and the value (Q, V) is the square of the document length ($n$). The product of the query and the value (Q, V) is the square of the document length ($n$). Therefore, given 2046 tokens as input, the matrix size used in the Attention calculation is 2024*2024, which means that a matrix with approximately 4.1 million elements must be processed in the Attention calculation.

The amount of computation required for a batch size calculation is enormous, and it is necessary to consult with the memory capacity. The Long Transformer in this paper addresses the problem of the Transformer's computational complexity increasing with the square of the input sequence.

What is Longformer?

In the Scaled Dot-Product Self-Attention, the amount of computation and memory usage was $n^2$ due to the fact that Attention was directed from all words to all words. Therefore, we proposed a device to direct Attention only from important words to important words as much as possible.

In fact, as shown below, the computation time and memory capacity are considerably reduced.

Longformer Learning Devices

This section describes Attention innovations in Longformer. Specifically, we propose three Attention innovations.

(a) Full $ n^2$ attention

Scaled Dot-Product Self-Attention as used in Trasformer. Thus, attention is directed from all words to all words.

(b) Sliding window attention

Sliding window attention only directs attention near oneself. The memory usage can be calculated based on the number of words to which attention is directed. The memory usage is linear with the length of the document, and the amount of memory used is $O(nw)$ when the length of the document is $n$. and the computational complexity is $O(nw)$ when the length of the document is $n$.

The window size $w$ is set differently for each layer. Specifically, the window size $w$ is set smaller in the downstream layer and larger in the upstream layer. This allows the

- Downstream layer collects local information
- Upstream layer collects overall information

The effect is similar to that of

(c) Dilated sliding window

Dilated sliding window is a method of directing attention not only near oneself, but also at a moderate distance, while reducing computational complexity, by sliding words at regular intervals. In the paper, the gap size is expressed as $d$. It also seems that performance can be improved by skipping $d$ words and changing the value of $d$ for each multi-headed attention.

However, the dilated sliding window is not used in the downstream layer to collect local information, but is limited to the two heads in the upper layer.
This is because experiments have been conducted and the results show that increasing the window size improves the accuracy, and limiting the window to only the two heads in the upper layers improves the accuracy more than not limiting the window to only the two heads in the lower layers, as shown below.

(d) Global+sliding window

This is a combination of Global Attention and Sliding window attention. Global Attention focuses Attention on all words for a specific word, and on a specific word for any word.

Specifically, when considering BERT, the special token [CLS] is used to classify the class of a document. The idea is that Attention should be directed to all words when [CLS] is in position, and Attention should be directed to [CLS] for words other than [CLS]. This allows [CLS] to pay attention to all words and to obtain the characteristics of the whole sentence when classifying documents. The computational complexity of Global Attention is $O(n)$ since we are only talking about specific word positions.

It is claimed that this Global Attention is supposed to be used together with Sliding window attention.

In addition, for Global Attention and Sliding window attention, the model does not compute Attention with a single $(Q, K, V)$, as in Transformer, but with linear mappings of each. This allows the model to flexibly adapt to each Attention, and thus perform better in downstream tasks.

Improved accuracy per task

Training of models for evaluation

When training a model for evaluation, we divide the training into 5 stages. Ideally, a model should be trained with the largest window size and sequence length in GPU memory. However, this is not possible when training in a long context. Also, in the early stages of training, many gradient updates are required, and training can be done efficiently by starting with short sequences, setting the training rate low, and reducing the window size. Therefore, in this study, learning is divided into five phases with a start sequence length of 2,048 and an end sequence length of 23,040.

The following table details the settings of each parameter at each stage of the training of the model that achieved the best performance.

Learning Results

The training results show improved performance over the old Transformer and Reformer, which attempts to improve on Transformer.

Verification with BERT

A breakthrough is the task-by-task accuracy comparison in BERT, which has not been done in Reformer or other systems. According to this comparison, BERT achieves high accuracy in almost all tasks.

The results of fine tuning using RoBERTa are used for verification. However, since RoBERTa has only 512 Position Embeddings, 8 Position Embeddings were copied to accommodate the 4096 words. Despite its simplicity, it is claimed to be very effective, and this seems to be due to the fact that the copying eliminates partition boundaries.

Results of verification in BERT

The evaluation of each task in the BERT-related model is as follows

Each task is a

WikiHop:.
Tasks to read multiple documents and link information together to answer a question

TriviaQA:
A dataset in which an instance consists of a question, an answer, and a rationale.

HotpotQA:
Task to read multiple documents and connect the information to answer the questions.

OntoNotes (Coreference Resolution):
A task to find two or more words that refer to the same object and map them to each other.

IMDB:
Classification task for relatively short sentences in movie review dataset

Hyperpartisan:
Classification task for relatively long sentences in news datasets

In each task, Longformer performs well on all datasets, demonstrating the usefulness of Longformer.

Summary

Longformer is a scalable Transformer-based model for processing long documents that can easily perform a wide range of document-level NLP tasks without chunking or shortening long input and without using complex architectures for combining information across chunks. Longformer combines local and global information and scales linearly with sequence length using three attentions (Sliding Window Attention, Dilated Sliding Window Attention, and Global Attention) are employed. Longformer also achieved the best performance on the text8 and enwik8 tasks. In addition, Longformer consistently outperformed RoBERTa on the long document task and achieved the best performance on WikiHop and TriviaQA when trained in advance.