Catch up on the latest AI articles

The Original Transformer Model Still Rules Even After 3 Years!

The Original Transformer Model Still Rules Even After 3 Years!


3 main points
✔️ Survey of the various transformer modifications over the years.
Comparison of the Vanilla Transformer model with 25 different variants. 
✔️ Suggestions to improve research productivity in transformer architecture modifications. 

Do Transformer Modifications Transfer Across Implementations and Applications?
written by Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, Colin Raffel
(Submitted on 23 Feb 2021)
Accepted to arXiv.

Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)


First of all

Most progress in deep learning is made due to incremental improvements in architecture, loss functions, and training techniques followed by sudden leaps in performance. Residual connections, normalization, dropout, Adam,  are some of the sudden leaps that improved the state of the art across all domains. Since the beginning, slight regular modifications to CNN architecture have helped in improving the state of the art in several vision tasks.

However, the transformer, which was proposed in 2017 is still used without much changes to the original model. Transformers have proved to be efficient not just for sequence-to-sequence modeling in NLP-related tasks but the same architecture also works well for vision tasks(ViT). Although several modifications in activations, normalization, depth, embeddings, and weight sharing, have been proposed by the scientific community that works well in certain tasks, there have not been widespread adoptions of those techniques. 

In this paper,  we investigate how these various modifications to transfer across other modalities than the ones they were proposed for and try to conjecture why some methods do not work across multiple modes. 

Some Key Concepts

This section includes a brief overview of the transformer architecture. For more detailed information read the original paper or this article.

A transformer network is made up of two main parts: the encoder and the decoder. The encoder takes in a sequence of vectors {x[1], x[2],.......x[T]} and is trained to predict labels {y[1], y[2],....y[U]}. In the encoder, each token in the input sequence is replaced with its corresponding embedding vector of dimension dmodel. Since the self-attention operation is permutation invariant, positional encodings p[t] are added to the embedding vector. 

Each block in the encoder comprises of two sub-blocks: a multi-headed self-attention(MHSA) layer followed by a  dense feed-forward layer. The input to a MHSA block is passed through three different dense projection layers to obtain a key(k), value(v) and query(q) vector. Each head 'h' in layer 'l' of the network performs the following computation:

The output of all MHSA blocks is concatenated and re-projected back to dimension dmodel. This output is added to the residual input, normalized, and passed through the dense feed-forward network with ReLU non-linearity.

Both of the sub-blocks have a residual connection and use layer normalization. Layer normalization is an operation applied to a sequence h[1],h[2]...h[T] and defined as:

Here, γ, β ∈ Rd_model have learned parameters unique to each layer norm layer.

One difference in the decoder is that decoder makes use of an attention mask that prevents the decoder from attending to future items in the target sequence. Another difference is that the decoder receives the key and the value projection input from the encoder while it receives the query projection input from the masked MHSA layer. The remaining components are similar to the encoder. 

Modifications to the Original Model

In this section, we look at the various modifications with which we tested the transformer model.


We tried replacing the ReLU in the feed-forward network with GeLU,  Swish, Exponential linear units (ELU), SeLU, Sigmoid(σ), Softplus, Gated Linear Units(GLU). A GLU computes the operation F(x)・σ(F(x)). We try replacing the sigmoid in a GLU with ReLU(ReGLU), GeLU(GeGLU), and standard linear transformation(LiGLU).


We tried Root-Mean-Square Normalization(RMS norm), Rezero initialization, Rezero+layernorm, Rezero+RMSnorm, and Fixup initialization. 


We explore the tradeoff between the width(dff) and depth(L) of the feedforward sub-block, keeping the parameter size constant.

Parameter sharing

The transformer has three weight matrices of shape dmodel×dvocab: the encoder input embedding, the decoder input embedding, and the decoder output(softmax) layer weights. We try tying(Shared weights) and untying(using different weights) these matrices. Specifically, we tested by tying encoder input and decoder input embeddings, tying decoder input and output embeddings, and untying all embedding matrices. We also tested on  “Adaptive input embeddings”, where items are clustered based on their frequencies. The embedding dimension is proportional to the cluster size and the embedding vectors are projected to the same dimension and concatenated. In some cases, we factorized the embedding matrix of size dmodel× dvocab into dmodel × dinner and dinner × dmodel. In addition to sharing embeddings, we also experiment with sharing a set of other parameters(like self-attention block) across all L layers.  


We tested a variation of Softmax called Adaptive Softmax. Adaptive Softmax forms clusters based on the word frequency. Each cluster can have a different size and the size of the rare word clusters is reduced using a projection matrix. Another method called the mixture of softmax(MoS) computes a linear combination of softmax weighted with learned coefficients, instead of just one softmax.


We test on a number of architectures that have been proposed by the scientific community over the years:

  1. Transparent Attention uses weighted residual connections.
  2. Evolved Transformer which was designed using evolution-based architecture search.
  3. We experiment with factorized, dense, and random Synthesizer variants where self-attention operation is replaced with “synthetic attention” patterns. “plus” is used when dot product attention is added to synthetic attention, and "plus alpha" is used to denote case when a scalar weight α is used to interpolate between synthetic and dot product attention.
  4. Funnel Transformer reduces the sequence length of the encoder to efficiently encode input sequence.
  5. The Lightweight Convolutions shares the weight of consecutive m channels(total d_model channels) is shared and normalized together. In Dynamic Convolutions, the value m is not left as a fixed hyperparameter but is determined using a simple linear function of the layer input.
  6. The Switch Transformer and Mixture of Experts(MoE) transformer make use of adaptive computation i.e. they replace the feedforward layers with sparsely activated expert layers that learn to select the parameters for each token. 
  7. Product Key Memory networks are similar to the expert layers except that they use a k-nearest neighbor weighted sum to select the parameters instead of a learned approach.
  8. Universal Transformers make use of the same transformer block to the input sequence repeatedly until certain conditions are met.

Experiment and Evaluation

We test all the variations listed in the previous section against our baseline model, the original transformer model(called 'vanilla transformer') with two modifications: the layer norm is used before the self-attention and feed-forward sub-blocks, and relative positional encodings are used instead of sinusoidal positional encodings. All the hyper-parameters, parameter count/FLOPs, training set, optimizers were kept constant across all models. The models are pre-trained on the C4 dataset and tested on three transfer-learning tasks: SuperGLUE for natural language understanding, the XSum abstractive summarization dataset, and the WebQuestions question answering task. In addition, they were tested on a supervised training task on the WMT’14 English to German translation task. For more implementation details please refer to the original paper.

The variants of GLU(LiGLU, SwiGLU..) were found to be effective than ReLU activation. In addition, RMSNorm improves the model speed and performance. Mixture-of-experts and Switch transformers perform well but the number of parameters is much greater than vanilla transformers, although the operations(Ops) are comparable as all the parameters are not used in each step. 

We must note that modifications that significantly improved performance lie in one of three categories: minor changes (GLU variants, RMSNorm, untying embedding matrices); increase parameter count (Switch Transformer, product key memory) or are slower (mixture of softmax, deeper models); or those that were originally invented in the Mesh TensorFlow codebase used in these experiments (mixture of experts, switch Transformer, synthesizer). 

Very few techniques were able to improve the model performance of the vanilla transformer which is contrary to what is suggested by the experiments included in the original paper of these techniques. To ensure reliability, we asked 12 of the authors of those papers to validate our implementations, all of whom responded in the affirmative.


This paper shows how important it is to try out modifications to a model across a variety of tasks and code bases before proposing an architectural modification. Transformer modifications should be tested on supervised learning, transfer learning, language modeling, or even vision tasks. It is important to keep the hyperparameters constant while comparing the modifications and cherry-picking well-tuned models should be avoided. These codes of conduct should allow future works to develop flexible techniques that might receive widespread adoption across a spectrum of tasks without depleting valuable time and resources. 

Thapa Samrat avatar
I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us