Will It Be A Breakthrough For Transformer Scale Up? Introduction Of The Highly Efficient Reformer

Article 24/01/2020

3 main points
✔️ Dramatically reduces Attention computation from O(n^2) to O(n log n)
✔️ Dramatically reduces memory usage for activation and other functions
✔️ Significantly improved implementation efficiency in both speed and memory, while maintaining Transformer performance

Reformer: The Efficient Transformer
written by Anonymous
(Submitted on 13 Jan 2020 (v1), last revised 18 Feb 2020 (this version, v2))
Comments: Accepted at ICLR2021
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)

Official
Comm

Transformer's record-breaking results supported by large-scale applications

Research using transformers is producing state-of-the-art results one after another, but their applications are becoming larger and larger. However, the applications are becoming larger and larger, and this trend toward larger-scale means that it is becoming more difficult to conduct research outside of large research institutions. This is a problem that is currently being pointed out and discussed.

So how big is it getting?

It takes 2GB of memory (32-bit floating point) to hold 0.5B (=500,000,000) parameters of one Transformer layer (Trm).
When the token length is 64,000, the Embedding size is 1,024, and the batch size is 8, the Activation (the result of forward propagation) is also 64K x 1K x 8 = 0.5B, or 2GB.

These sizes can be fatal to the actual calculation.

If the Trm has 12 layers, then Activation is 2GB x 12 = 24GB, which needs to be held during training until backpropagation.
Attention's calculation is O(L^2) in terms of both computation and memory for a token length L. That is, even with a batch size of 1, if L=64KB, 64KB^2 x 4 (32-bit floating point) = 16GB.

At this scale, a few GPU configurations are not enough to handle it at all.

Let's take a look at some specific examples of such uses.

To read more,

Please register with AI-SCHOLAR.