Efficient Transformer's Exclusive Benchmark "Long Range Area" Is Now Available!

Transformer 25/02/2021

3 main points.
✔️ Proposed "Long Range Arena" benchmark for Efficient Transformer
✔️ Covers tasks consisting of long sequences across various modalities
✔️ Compare and validate 10 of the various models proposed in the past

High-Performance Large-Scale Image Recognition Without Normalization
written by Andrew Brock, Soham De, Samuel L. Smith, Karen Simonyan
(Submitted on 11 Feb 2021)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)

First of all

The computational complexity of Self-Attention is a major challenge in the use of the Transformer (especially for long sequences). In previous summaries (1 ,2 ,3 ), we have discussed many examples of research to reduce the computational complexity of the Transformer.

However, as you can see in the aforementioned series of articles, there has been no benchmark to compare these methods that have improved the Transformer. Therefore, even though various improvement measures were proposed, important information for actual use, such as the characteristics, properties, and effectiveness of each model, was missing.

In order to provide this important information, this article introduces the "Long Range Arena ", a benchmark for evaluating improved Transformer methods (Efficient Transformers).

This benchmark includes tasks with various modalities, with long sequences ranging from 1,000 to 16,000 tokens. We also present the results of the comparisons and evaluations made by this benchmark on ten representative models from the various models presented in previous articles.

Long-Range Arena (LRA)

As a desirable property for benchmarking various transformers, LRA aims to meet the following six requirements (Desiderata)

Generality: The task should be one to which all Efficient Transformer models can be applied (i.e., the task can be accomplished by encoding alone).
Simplicity: The task should be simple, and elements that make it difficult to compare models should be eliminated (including prior learning).
Challenging: the task must be sufficiently challenging for the current model.
Long inputs: The input sequence should be reasonably long in order to assess whether it can capture long-range dependencies.
Probing diverse aspects: a series of tasks to be able to evaluate the various capabilities of a model.
Non-resource intensive and accessible: Does not require a large number of computing resources.

The next section describes the six tasks included in the LRA.

1.LONG LISTOPS

This task focuses on the ability to capture long-range dependencies in input sequences. It is a larger sequence-length version of the standard ListOps task () and is designed to investigate the analysis capabilities of neural network models. An example task is shown below.

Thus, it consists of a hierarchical structure enclosed in parentheses and operators (MAX, MEAN, MEDIAN, SUM_MOD). The sequence length is up to $2K$.

It is a 10-way classification task where the output can be any of 0 to 9. It is quite a challenging task because it requires knowing the tokens and logical structure of all the input sequences.

2.BYTE-LEVEL TEXT CLASSIFICATION

This task differs from normal text classification (where a sequence of words, etc. is given as input) in that it is a byte/character level text classification task. The byte-level setting is also very different from character-by-character language modeling.

For example, in character-by-character language modeling, given the word "appl," we might expect it to be followed by an "e." Byte-level text classification, on the other hand, is a much more difficult task, and cannot be solved by simply capturing the nearby context.

For the dataset, we use IMDb reviews, a commonly used text classification benchmark, with a maximum sequence length of 4K. It is a binary classification task and accuracy is obtained as a measure.

3.BYTE-LEVEL DOCUMENT RETRIEVAL

This task asks for a similarity score between two documents at the byte/character level, similar to text classification. It aims to measure the ability to compress long sequences and to obtain a representation suitable for similarity-based matching.

We use the ACL Anthology Network (AAN) as our dataset. The sequence length of the two documents is 4K for both and the total length of the text is 8K. This is a binary classification task and accuracy is obtained as a measure.

4.IMAGE CLASSIFICATION ON SEQUENCES OF PIXELS

This task is an image classification task for $N×N$ images converted to a sequence of pixels of length $N^2$. It focuses on the ability to learn relationships in 2D image space from a one-dimensional pixel sequence (additional modules such as CNNs are not allowed).

For simplicity, the input image is converted to grayscale with 8 bits per pixel, and CIFAR-10 is used as the dataset.

5.PATHFINDER (LONG-RANGE SPATIAL DEPENDENCY)

The PATHFINDER task is used to learn the long-range spatial dependency. This task determines whether two points are connected by a dashed line, as shown in the following figure.

Images are treated as sequences of pixels. In this task, the image is $32x32$ and the sequence length is 1024.

6.PATHFINDER-X (LONG-RANGE SPATIAL DEPENDENCIES WITH EXTREME LENGTHS)

This is a version of the aforementioned PATHFINDER task with a sequence length of 16K ($128×128$ images). Although the sequence length is significantly increased compared to the normal (1024 sequence length) case, there is no significant difference in the task itself. The purpose of this task is to see if the difficulty of solving the task changes significantly when the sequence length is simply increased.

Required Attention Span

The main goal of the LRA benchmark is to assess the ability of the Efficient Transformer model to capture long-range dependencies. Here, by defining a metric called required attention span, we quantitatively estimate the long-range dependencies that need to be captured for each task.

In other words, it indicates the degree of ability to grasp the long-range dependencies required for the model to solve the task. (This metric is obtained by scaling the average distance between the QUERY and ATTENDED tokens by the ATTENTION weights, given a trained model and a set of tokens as input.)

The results of the comparison of each task by this metric are presented in the following figure.

The magnitude of this metric indicates that the model needs to be highly capable of capturing long-range dependencies, rather than being capable of adequately handling local information.

experiment

model

The models evaluated in the experiment are as follows.

For explanations of these models, please see the previous explanatory articles (1 ,2 ,3 ).

Task Performance Comparison

The results for the various architectures in the Long Range Arena benchmark are as follows

(Although the experiments have been evaluated as impartially as possible, it is possible that the optimal hyperparameters may differ from model to model, etc., and it is not an accurate judgment of which model is the best.)

About the result of ListOps

For the ListOps task, the best model has an accuracy of 37%, indicating that it is a rather difficult task. Since this task is a 10-valued classification task, if it were completely random, the accuracy would be 10%, which means that the model is able to learn the task in a small way. Since ListOps is hierarchical data, this may suggest the ability of the model to handle hierarchical structures.

For example, kernel-based models (e.g., Performer, Linear Transformers) have low performance and may not be suitable for dealing with hierarchical structures.

About the result of Text Classification

In contrast to ListOps, we see that the kernel-based model performs better. It can be thought that this is the result of the orientation of the model.

About the result of Retrieval

It turns out to be a daunting task, with even the best models only achieving less than 60% performance.

The best performing models are Sparse Transformer and BigBird, with models consisting of fixed attention patterns showing relatively better results, and low-rank factorization and kernel-based models showing relatively poorer results.

About the result of Image Classification

Overall, the variance in performance across models is small for this task. Linformer and Reformer are relatively inferior, while Sparse Transformer and Performer are relatively superior.

It has also been observed that overfitting to the training set occurred in this task, making it difficult to generalize on the test set.

About the result of Pathfinder / Path-X

For the normal Pathfinder task, all models achieved some performance. The average performance was 72, with the kernel-based models (Performer and Linear Transformer) performing particularly well. For Path-X with very large sequence lengths, all models failed to learn (about 50%, the same as the random performance). Even though the task is essentially the same as Pathfinder, we found that the task solution becomes significantly more difficult with increasing sequence length.

Efficiency Comparison

Next, as a comparison of the efficiency of each model, the training execution time and memory usage for different sequence lengths are shown below.

The benchmark is run on 4x4 TPU V3 Chips and shows the number of steps per second with a batch size of 32 (the ranking may change depending on the hardware running it).

About training speed

Particularly fast were the low-rank factorization and kernel-based models, with the fastest model being Performer. In particular, when the sequence length was 4K, it was 5.7 times faster than the regular Transformer.

The Reformer is also consistently slower than the normal Transformer at all sequence lengths.

About memory usage

The lowest memory usage is Linformer, with a sequence length of 4K reduced to about 10% of the normal Transformer (9.48GB -> 0.99GB ). As well as speed, the kernel-based models (Performer and Linear Transformer) are also relatively good.

We also see that Linformer and Performer do not increase memory usage significantly with increasing sequence length.

Overall result (universal architecture does not exist yet)

In terms of average performance across all tasks, the best performer is BigBird, which consistently performs well across all tasks. The kernel-based models (Performer and Linear Transformer) have lower overall averages due to their poor performance on the ListOps task.

In the following figure, the trade-off between score (y-axis), model speed (x-axis), and memory footprint (circle size) are shown.

This figure shows that BigBird is almost the same as a regular Transformer in speed, even though it is better in performance. The kernel-based models (Performer and Linear Transformer) show some performance and good speed. As explained in the individual task results, kernel-based models are not suitable for handling hierarchical structures, and each model has its own characteristics and features.

Therefore, Whether speed is important, performance is essential, memory usage is essential, what kind of task you want to solve, etc., the appropriate model depends on the assumed conditions (at least for now), and it can be said that there is no universal model yet. Is.

summary

In this article, we presented a benchmark for Efficient Transformer, which consists of tasks across various modalities such as text, math, and image data. Through this benchmark, various Transformer models were compared and their characteristics and performance (speed/memory) were shown.

Overall, it can be said that information that was previously unclear, such as the trade-offs that occur in terms of model quality, speed, memory, etc., was presented in an easy-to-understand manner.

This benchmark will eventually be open-sourced and may become a cornerstone in future Efficient Transformer research.