Transformer Is All You Need: They Can Do Anything!

Transformer 29/03/2021

3 main points
✔️ One transformer model for 7 different tasks across 8 different datasets in vision, NLP, and vision +NLP tasks.
✔️ Competitive performance to current SOTA models.
✔️ Parameter efficient compared to task-specific models.

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
written by Ronghang Hu, Amanpreet Singh
(Submitted on 22 Feb 2021)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

code：

Introduction

Our brain is highly flexible in the things it can do. It enables you to speak, write, listen, see, think, move, and much more. This versatility is also a desirable quality in artificial neural networks. Recently, transformers have been used for a wide variety of tasks like vision(object detection, instance segmentation), NLP(sentiment analysis, language modeling), vision + NLP( visual entailment, visual question answering), and more. Despite this, consolidating all of these tasks into one transformer model is a feat that has not been achieved successfully until now!

In this paper, we describe a multimodal encoder-decoder transformer model named UniT which is able to simultaneously handle 7 different tasks across different domains from object detection to natural language understanding, visual question answering, and more.

Unified Transfomer (UniT): One For All

UniT is composed of two different encoders: one for encoding images with a CNN backbone, and one for encoding text. UniT is based on DETR model and consists of a joint decoder that takes task-specific query embeddings. It also has task-specific output heads. Each of the components is explained below:

1) Image Encoder

The image encoder is necessary for vision-only or vision-and-text tasks. It consists of a CNN backbone to extract the local features followed by a multi-headed self-attention encoder to capture global contextual features. The CNN backbone is made of ResNet-50 architecture with dilation in the last c5 stack and pre-trained for object detection. The feature map produced by this CNN encoder for the image I is flattened and passed on to the visual transformer encoder to generate a set of visual hidden states h^v = {h^v₁, h^v₂...h^v_L}. Here L is equal to the H_vxW_v, where H_v and W_v are the height and widths of the feature map produced by CNN encoder.

Some tasks such as object detection and VQA require task-specific information to be extracted from the images. For this, we also use a task-embedding vector as follows:

P_b-->e is a linear transformation that transforms the dimension of feature map vectors into the hidden dimension of the transformer encoder. w_v^task represents the task embedding vector which is concatenated to the transformed feature map vectors. E_v is the visual transformer encoder.

2) Text Encoder

The text encoder is necessary for language-tasks like GLUE, QNLI, SST-2, QQP, and language-and-vision tasks like VQA. It is made up of BERT which is a language model pre-trained using masked language modeling and next sentence prediction. Given a sequence of words, the words are tokenized into S tokens {w₁,w₂...w_s} where the first token w₁ is a special token [CLS] (for classification in BERT). Similar to the image encoder, we also append task-specific embeddings vectors w_t^task to the sequence of words. Once BERT encodes the sequence, the encodings corresponding to the w_t^task vectors are removed as shown below.

It was also found that using only the encodings corresponding to just the [CLS] token worked very well which also saves some computation.

3) Domain-agnostic Decoder

Unlike the encoders, for our main model, the same decoder is used for all modes. For vision tasks, only encodings from the vision encoder are used and for language tasks, only encodings from the text encoder are used. The two encodings are concatenated together for language-and-vision tasks.

The transformer decoder takes in the encodings h^enc from the encoders and task-specific query embedding sequence q^task of length q. Each of the l transformer decoder layers outputs a sequence of decoded hidden states h^dec,l also with length q. Self-attention is applied among the decoder hidden states h^dec,l, and cross-attention is applied to h^enc.

4) Task-Specific Output Heads

Each task t has its own prediction head. For object detection, the output head predicts q bounding boxes for each hidden state in the decoder output along with a class prediction for each of those q boxes. Each position can predict either a class or background. For some datasets like the Visual Genome Dataset with attribute annotations on each box, we also add an attribute classification output head.

All c^l, b^l, and a^l have the same sequence length q like the query-embedding q^task for detection.

All other tasks like natural language understanding, VQA, and visual entailment can be modeled as a classification task among c_t classes for task t. For classification, we use the output from the first decoder h₁^{dec, top} and pass it through a two-layer MLP with GeLU activation. The class prediction p is trained using the cross-entropy loss with ground truth targets t.

Experiments and Evaluation

Sample Results on Various Datasets

UniT was jointly trained on multiple tasks across several domains. It was trained and validated for object detection(COCO dataset and Visual Genome-VG dataset), natural language understanding(GLUE benchmark: QNLI, QQP, MNLI-mismatched, and SST-2), and joint vision-and-language tasks(VQAv2 dataset, SNLI-VE). VQAv2 has questions and answers from the visual genome dataset, and SNLI-VE dataset requires classifying if the image entails, contradicts, or is neutral to the text description.

The above table shows the results of analysis on object detection and VQA. Shared means the same decoder was used for all tasks and separate means independent decoders were trained. The shared decoders were initialized using weights from the model pre-trained on COCO dataset. Joint training with shared decoder performs well on COCO and VG and is beneficial for VQAv2. VQAv2 accuracy is highest for joint training with separate encoder. Overall, it can be said that joint training helps object detection and VQA.

The above table shows the results of the UniT model on all 8 benchmarks. The results are highly competitive with SOTA models: BET, DETR, VisualBERT. Due to low cross-modality overlap, the UniT- single-task training beats current SOTA models on several tasks(VG, QNLI, MNLI, QQP). Also, it is necessary to note that UniT- single-task was trained for 500k iterations on the same dataset while the multi-modal UniT was also trained for a total of 500k iterations across different tasks. All hyperparameters were kept the same across all UniT training tasks. The UniT-shared has 8x fewer parameters since it can perform the same number of tasks as 8 models combined, with comparable accuracy.

Conclusion

The UniT model can perform 7 different tasks across 8 different datasets and the performance is comparable to task-specific SOTA models. This brings us a step closer to modeling an artificial general intelligence(AGI) system which has the flexibility and abstraction capabilities of the human brain. Future research works should aim to incorporate more modalities like Speech recognition, translation, game playing capabilities, Image generation, into the system and furnish this quasi-AGI even more. Recent endeavors have enabled transformers to be universally used in these tasks (even generating images through TransGAN). So, it is likely that these tasks could also be incorporated into the transformer model. That will make the model capable of covering a wider spectrum of tasks that humans can do. For further information please refer to the original paper.

Categories related to this article

Thapa Samrat: I am a second year international student from Nepal who is currently studying at the Department of Electronic and Information Engineering at Osaka University. I am interested in machine learning and deep learning. So I write articles about them in my spare time.