Decompose The Transformer To Capture The Independent Mechanism?
3 main points
✔️ Transformer incorporating the Independent Mechanism Hypothesis
✔️ Decompose the Transformer into multiple modules by leveraging the Attention mechanism
✔️ Confirmed effectiveness in a wide range of tasks using the Transformer
Transformers with Competitive Ensembles of Independent Mechanisms
written by Alex Lamb, Di He, Anirudh Goyal, Guolin Ke, Chien-Feng Liao, Mirco Ravanelli, Yoshua Bengio
(Submitted on 27 Feb 2021)
Comments: Accepted by ICML 2021.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
first of all
The Transformer architecture, which is present in the popular Zero-shot language generation model GPT-3 and the non-distributive image generation model DALL-E, learns all positional information in a single large latent representation. However, this means that irrelevant information is processed at the same time, which limits the ability to capture the independent structures that exist in the world. This article introduces Transformer Independent Mechanisms (TIM), an improvement on Transformer that points out this problem.
A key concept in TIM is the Independent Mechanism Hypothesis, which states that physical phenomena can be thought of as movements of independent modules that are governed by independent mechanisms that exist behind them. In fact, this independent mechanism hypothesis is a major premise in the causal inference community, but there are still few studies dealing with it in deep learning, such as Independent Causal Mechanism and Recurrent IndependentMechanisms. Today, we will discuss how TIM is designed. What kind of experimental settings are used to achieve results? In this article, I will introduce how TIM is designed and what kind of experimental settings are used to achieve the results.
Architecture of TIM
The TIM proposed by the authors can be seen as an architecture in which the Transformer is decomposed into several independent parts, such that several Mechanism modules are placed in one position.
A very simple example with three Mechanisms per position is shown in Figure 2. The architecture is to update latent variables by FFN forward propagation network along the two axes after sharing the information by Attention calculation along the two axes, i.e., the axis of position (corresponding to the time axis of the series model) and the axis of Mechanism module, respectively. In addition, the proposed method TIM can be simply replaced with the standard Transformer layer and can be easily applied to other methods using the Transformer. Now, we will explain the details of TIM according to the four steps of the algorithm.
Competition among Mechanisms
First of all, to increase the specialization of each Mechanisms module (the property that one module performs one type of processing), we share information only by calculating Attention while having our own parameters. On top of that, we want to introduce an even stronger inductive bias, so we introduce a mechanism to calculate the relevance score by Attention, similar to the previous study RIM, to induce competition.
Specifically, as the equation in step 1 of the algorithm shows, the representation (h) of each module is linearly transformed to a single value (GroupLinear), and then the score is calculated by Softmax calculation. Based on the score, we weight how much information is accessible and updated by the Mechanisms module. This weight is used to update the latent variable by performing an Attention along the position axis later on.
In order to secure the information that each Mechanism module wants to obtain, the relevance score of other modules must be suppressed, which is expected to have the effect of increasing the specialization of the Mechanisms module.
Position axis information sharing
In the next step 2, each Mechanism module performs an Attention calculation along the position axis and does a linear transformation (GroupLinear). GroupLinear, which also appeared in Step 1, refers to a layer in which linear transformation is performed only between the divided modules (Group), in contrast to the general (Linear) linear transformation. Finally, the latent variable h is updated and homogenized by weighting the score calculated in step 1. It is important to note that an architecture that consists of only position axis information sharing can be viewed as simply combining multiple independent Transformers.
Information sharing of Mechanism module axis
We have done that each Mechanisms module processes information independently, but we believe that minimal information sharing between modules is also necessary. In step 3, only a small amount of information is shared by performing Attention calculations along the Mechanisms module axis, using only the Multi-head Attention of the 32-unit 2heads.
Update latent variables in forwarding propagation
In step 4, we update the latent variable h by performing a linear transformation in the FFN forward propagation network on two axes, the Mechanism, and Position axes, respectively.
The authors answer two questions to ascertain the effectiveness of TIM on a data set that they believe contains independent mechanisms.
1. is TIM capable of learning Mechanism modules with reasonable and meaningful expertise? For this one, we will test it on toy data and real-life large-scale speech recognition and language processing tasks.
2. can we leverage models with independent mechanisms to improve quantitative accuracy? This one will be tested in tasks such as speech enhancement and BERT MLM.
Since the Transformer is used in a wide range of fields and can be experimented with by simply replacing the Transformer with TIM, it may be a good idea to think about whether an independent mechanism exists in the reader's research topic and whether TIM can be applied.
Here, we use an image-generating model that incorporates TIM into the GPT-2 architecture to solve a homemade task with two distinctly different mechanisms.
Specifically, we evaluate whether we can specialize against one of the two by visualizing one excited Mechanism module using a synthesized dataset that exists with MNIST numbers on the left and randomly chosen CIFAR images on the right.
From the right side of Fig. 3, we can see that TIM was able to specialize different Mechanism modules to both sides of this synthetic dataset. Interestingly, the module that specialized in color brightness at the beginning of the learning process is able to specialize to different datasets on the left and right sides as the learning progresses.
We were also able to specialize in the CIFAR-10 dataset for objects and backgrounds (Figure 3 left).
Speech enhancement is the task of improving the quality of real-world, noisy speech data. Traditional approaches based on signal processing techniques achieve this by detecting and removing non-linguistic sounds. In recent years, methods based on Transformer have surpassed the traditional approaches and shown their effectiveness.
If we think of the explicit decomposition of linguistically meaningful and non-linguistically meaningful sounds as processing data produced by different mechanisms, then TIM is well suited.
Table 3 shows the experimental results using the PESQ measure of sound quality on a DNS dataset of good quality speech data plus noisy speech. The proposed method TIM achieves SOTA with only 1/8th of the parameters of the state-of-the-art method PoCoNet.
From the visualization in Figure 5, we can also see that the independence of the Mechanism module becomes more apparent as the layers get deeper and the modules become more specialized.