Catch up on the latest AI articles

Will It Be A Breakthrough For Transformer Scale Up? Introduction Of The Highly Efficient Reformer

Will It Be A Breakthrough For Transformer Scale Up? Introduction Of The Highly Efficient Reformer

Article

3 main points
✔️ Dramatically reduces Attention computation from O(n^2) to O(n log n) 
✔️ Dramatically reduces memory usage for activation and other functions 
✔️ 
Significantly improved implementation efficiency in both speed and memory, while maintaining Transformer performance

Reformer: The Efficient Transformer
written by Anonymous
(Submitted on 13 Jan 2020 (v1), last revised 18 Feb 2020 (this version, v2))

Comments: Accepted at ICLR2021
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
  
Official 
 Open In Colab   Open In Colab 
Comm 

Transformer's record-breaking results supported by large-scale applications

Research using transformers is producing state-of-the-art results one after another, but their applications are becoming larger and larger. However, the applications are becoming larger and larger, and this trend toward larger-scale means that it is becoming more difficult to conduct research outside of large research institutions. This is a problem that is currently being pointed out and discussed.
 
So how big is it getting?
  • It takes 2GB of memory (32-bit floating point) to hold 0.5B (=500,000,000) parameters of one Transformer layer (Trm).
  • When the token length is 64,000, the Embedding size is 1,024, and the batch size is 8, the Activation (the result of forward propagation) is also 64K x 1K x 8 = 0.5B, or 2GB.
These sizes can be fatal to the actual calculation.
 
  • If the Trm has 12 layers, then Activation is 2GB x 12 = 24GB, which needs to be held during training until backpropagation.
  • Attention's calculation is O(L^2) in terms of both computation and memory for a token length L. That is, even with a batch size of 1, if L=64KB, 64KB^2 x 4 (32-bit floating point) = 16GB.
At this scale, a few GPU configurations are not enough to handle it at all.
 
Let's take a look at some specific examples of such uses.

To read more,

Please register with AI-SCHOLAR.

Sign up for free in 1 minute

OR

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us