Will It Be A Breakthrough For Transformer Scale Up? Introduction Of The Highly Efficient Reformer
3 main points
✔️ Dramatically reduces Attention computation from O(n^2) to O(n log n)
✔️ Dramatically reduces memory usage for activation and other functions
✔️ Significantly improved implementation efficiency in both speed and memory, while maintaining Transformer performance
Reformer: The Efficient Transformer
written by Anonymous
(Submitted on 13 Jan 2020 (v1), last revised 18 Feb 2020 (this version, v2))
Comments: Accepted at ICLR2021
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
Official
Comm
Transformer's record-breaking results supported by large-scale applications
- It takes 2GB of memory (32-bit floating point) to hold 0.5B (=500,000,000) parameters of one Transformer layer (Trm).
- When the token length is 64,000, the Embedding size is 1,024, and the batch size is 8, the Activation (the result of forward propagation) is also 64K x 1K x 8 = 0.5B, or 2GB.
- If the Trm has 12 layers, then Activation is 2GB x 12 = 24GB, which needs to be held during training until backpropagation.
- Attention's calculation is O(L^2) in terms of both computation and memory for a token length L. That is, even with a batch size of 1, if L=64KB, 64KB^2 x 4 (32-bit floating point) = 16GB.
To read more,
Please register with AI-SCHOLAR.
ORCategories related to this article