How Specialists Collaborate: Shared Global Workspace

Transformer 16/08/2021

3 main points
✔️ Introduced ideas from Global Workspace in Cognitive Science to Transformer and RIMs
✔️ Propose a Shared Workspace with a competition mechanism and a broadcast mechanism
✔️ Demonstrated the effectiveness of Shared Workspace for information sharing among all specialists in a challenging experiment.

Coordination Among Neural Modules Through a Shared Global Workspace
written by Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu
(Submitted on 1 Mar 2021)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)

code：

The images used in this article are either from the paper or created based on it.

first of all

Structured models that clearly separate different information will be the trend in deep learning. In fact, AI research in the 1980s focused on how to design architectures that could generate wisdom. One of the most powerful ideas was that if modules with different roles could be coordinated well, it would be possible to create very complex systems.

Synchronization of the entire module has remained a problem because it is required in cases where the causal relationships between modules and their effects on each other are important. Inspired by the Global Workspace Theory of cognitive science, the author proposes to incorporate a shared representation into the module structure that can be accessed by all modules simultaneously. This shared representation can be influenced by any of the specialist modules and has the ability to broadcast information to all modules. Since the structure of such a shared representation that coordinates information between modules is similar to that of Transformers and RIMs, we build on and extend these architectures.

The Attention mechanism used in Transformers and RIMs interacts pairwise with positions. In other words, for every position, two positions are computed by Attention. In the paper, we point out that this pairwise interaction does not share information among all positions, and argue that we need a mechanism that allows all parts (modules) of the model to share information.

In a nutshell, the proposed method, Shared Workspace, allows each specialist module to write to the Shared Workspace only when it is most relevant to the input and automatically broadcasts the information from the Shared Workspace to all specialist modules. from the Shared Workspace to all Specialist Modules automatically. For the sake of clarity, we will refer to Specialist Modules as simply Specialists in the following description. Also, to avoid misunderstandings, technical terms related to the proposed methodology (e.g., Shared Workspace) are used without translation.

Synchronization between module structures with Shared Workspace

To replicate the global workspace architecture of cognitive science, the authors designed an architecture in which specialists communicate sparsely through Shared Working Memory. Specifically, they added a Shared Workspace to Transformers and RIMs (Attention and Slot-based modular architectures) and extended it to a mechanism for competing for permissions to write in modules. We hope that this Shared Workspace structure will allow for better synchronization and coordination between specialists.

While both Transformers and RIM use a pairwise Self-Attention mechanism to share information among specialists, the proposed method facilitates information sharing among specialists through a limited capacity Shared Workspace. Each computation stage consists of Step 1, where different specialists compete for the right to write to the Shared Workspace, and Step 2, where the contents of the Shared Workspace are broadcasted to all specialists simultaneously.

A concrete example is shown in Figure 2, where we add a Shared Workspace layer to Transformer (b) and Universal Transformer (d) and replace the module-to-module communication layer in RIMs (a) and TIMs (c) with a Shared Workspace layer. In addition, the operations of writing to Shared Workspace and broadcasting the information of Shared Workspace are realized by Attention mechanism. * For more information about RIMs and TIMs, please refer to this article. Now, we will explain the details of Shared Workspace in three steps.

Obtain a specialist's representation from the input information

In step 1, we obtain a representation for each specialist from the input information. This step prepares the representations for the different inputs of the specialists in RIMs and Transformers.

Here, the specialists prepare for the next two steps in each computation stage: 1. Each specialist competes to write to the Shared Workspace; 2. Each specialist receives the information broadcast from the Shared Workspace; 3.

Writing Information to a Shared Workspace

In Step 2, the specialists compete for the right to write to the Shared Workspace when they should update themselves for the new information received. In other words, the specialists who are relevant to the input information learn to have a higher relevance score calculated by Key-Value Attention. A specialist who reacts to the input somehow and updates frequently risks losing out to other specialists when truly important input is received. We can expect that such a competitive system will lead to a division of labor in which the specialist in charge reacts to different inputs.

Specifically, Key and Value of Key-Value Attention are linear transformations of the expressions of specialists obtained in Step 1, and Query is a transformation of the expression of Shared Workspace. In other words, Shared Workspace becomes a mechanism to access the related specialists. The k most relevant specialists are then written to Shared Workspace in order of their scores obtained by the Softmax calculation of Attention (Equation 1). This top-k operation can be thought of as a balance between the standard soft-attention (all specialists) and hard-attention (top-1 specialists).

(1)

Broadcast information from Shared Workspace

In step 3, we broadcast the Shared Workspace information to all specialists. Again, the relevance score calculated by the Attention mechanism is used to determine the degree to which the specialists are updated. However, contrary to step 2, we generate Query from the specialists and generate Key and Value of Shared Workspace to calculate Soft-Attention. All specialists update their latent representations based on the obtained Shared Workspace information (Equation 2). Here, the updates are LSTMs and GRUs for RIMs and FFN forward propagation layers for Transformers.

　(2)

However, h is the latent representation of LSTM and GRU, S is the relevance score, and v is a linear transformation of the Shared Workspace representation.

Shared Workspace Consistency and Computational Complexity

As for consistency, Shared Workspace updates at every step but resets at the end of the episode. In other words, RIMs will share the same Shared Workspace representation by the time the input series ends (the game ends), and Transformers will share the same Shared Workspace representation by the time the propagation to the final layer ends.

The proposed Shared Workspace structure requires O(n) for n number of specialists, while the Transformer and RIMs require O(n^2) because they calculate the relevance between two specialists using the Attention mechanism. In fact, the number n of spermyasilts is fixed, and the structure using Shared Workspace is very superior in terms of computational complexity and is suitable for large-scale experiments. (For reference, according to related literature, human working memory is less than 10, which is considered to be very small.

experiment

In our experiments, we show two things. (a) We confirm that Shared Workspace improves accuracy using a wide range of benchmarks, demonstrating the practicality and versatility of the proposed method. (b) We confirm that the accuracy improvement can be achieved without using pairwise interactions, and show that Shared Workspace can maintain consistency across specialists. The detailed experimental setup is presented in detail in the appendix of the paper and is available for reference.

The task to test understanding of image input

It is expected that specialists will write only information useful to downstream tasks to a limited Shared Workspace. We test this idea using a task that processes multiple visual information and the following baseline.

TR (Transformers): Transformers with shared parameters for each layer.
STR (Sparse Transformers): Transformers with a sparse Attention Matrix
TR+HC (High Capacity Transformers): Transformers with different parameters for each layer
TR+SSW (Transformers with Shared Workspace with soft-competition): Transformers incorporating Shared Workspace with soft-attention
TR+HSW (Transformers with Shared Workspace with top-k competition): Transformers that incorporate Shared Workspace with top-k attention

Detecting Equilateral Triangles: A Task for Detecting Equilateral Triangles

The task is to binary classify whether a point cloud appearing in a 64*64 image is an equilateral triangle or not, and the baseline TR is Vision Transformer (ViT). The image is divided into 4*4 patches and a series of images are input. Since the task can be solved by paying attention to specific information, we can hypothesize that Shared Workspace with limited capacity can pay attention only to the important information and improve the accuracy. The results (Figure 3) support this hypothesis, showing that TR+HSW has a smaller variance and better accuracy than Baseline's TR.

CATER: Object Tracking Task

Cater is given a video and the task is to guess in which cell of a 6*6 grid the target object is at the end of the video. If the target object is not hidden in the last frame, it is easy to solve the task, but if it is hidden by an obstacle, it is necessary to have a reasoning ability to track an invisible object accurately and for a long time. From Table 1, we can see that the proposed methods TR+HSW and TR+SSW are slightly better than the baseline.

Although this experiment does not show a clear improvement in accuracy, the author believes that the 36-class classification (6*6) achieves the same or better accuracy than the baseline despite the high task difficulty.

Sort-of-CLEVER: A Relational Reasoning Task

It is an inference task in which you are given an image of an object and a question that asks about the relationship between objects. To answer the question correctly, we need to focus on a particular object, which is randomly arranged in six different colors and two different shapes in a 75*75 image. and "What is the shape of the object closest to the red object? There are 10 non-relational questions for each image, such as "What is the shape of the red object? The task takes input images divided into patches as in ViT and is treated as a classification task because the answer to Sort-of-CLEVER is finite.

It can be seen from the results (Figure 4) that the proposed method with Shared Workspace converges faster and has better accuracy for both relational and non-relational problems. Therefore, we believe that Shared Workspace is superior to the conventional Transformers architecture for such tasks with discrete information.

Physical Reasoning task: Reasoning about physical processes

Predicting the motion of bounding balls in a physical process inference task requires capturing the motion of each ball separately. The task is given the first 10 frames and evaluated by the prediction loss at frames 30 and 45, using LSTM, RIMs, and RMC as a baseline to check the improvement in accuracy with RIMs + Shared Workspace. We experiment with several different conditions, and all results show that RIMs with Shared Workspace improves the accuracy and are superior to RMC in most conditions.

Shared Workspace for Multiagent Starcraft World Modelling.

Finally, we experimented with Starcraft, a multi-agent game environment (SC2 domain). This is a very challenging environment because each agent has complex skills and characteristics as well as state indicators such as an attack, defense, and HP values. However, the nature of the game with such discrete attributes and their interaction is suitable for the modular RIMs+Shared Workspace architecture, and the effectiveness of the proposed method can be confirmed. Please refer to Appendix G for details of the experiments.

The reason why the accuracy of RIMs in Table 2 is not good is that each pairwise interacting RIM communicates between two specialists, which is not suitable for this task where more than three types of information should be considered simultaneously. As a result, the proposed method using Shared Workspace is more accurate than LSTM and RIMs, which shows that the idea of maintaining information consistency among different specialists is effective.

Summary

Inspired by the Global Workspace theory in cognitive science, this thesis verified that adding and extending Shared Workspace to RIMs and Transformers maintains information consistency across all specialist modules. Through several experiments, we have shown that coordination among all modules using Shared Workspace is more effective than a baseline with pairwise interactions.

From a personal point of view, although the proposed method has not produced prominent experimental results, it tries to incorporate important concepts such as independent mechanisms of causal inference and the Global Workspace of cognitive science into deep learning. We believe that this is the motivation behind the exploration of new architectures, as existing deep learning architectures are not sufficient to achieve so-called strong AI. When the current AI-based on large-scale models and big data comes to a head, research on such architectures is bound to become important, so I think it is worth digging deeper.