Transformer Is All You Need: They Can Do Anything!
3 main points
✔️ One transformer model for 7 different tasks across 8 different datasets in vision, NLP, and vision +NLP tasks.
✔️ Competitive performance to current SOTA models.
✔️ Parameter efficient compared to task-specific models.
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
written by Ronghang Hu, Amanpreet Singh
(Submitted on 22 Feb 2021)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
code:
Introduction
Our brain is highly flexible in the things it can do. It enables you to speak, write, listen, see, think, move, and much more. This versatility is also a desirable quality in artificial neural networks. Recently, transformers have been used for a wide variety of tasks like vision(object detection, instance segmentation), NLP(sentiment analysis, language modeling), vision + NLP( visual entailment, visual question answering), and more. Despite this, consolidating all of these tasks into one transformer model is a feat that has not been achieved successfully until now!

In this paper, we describe a multimodal encoder-decoder transformer model named UniT which is able to simultaneously handle 7 different tasks across different domains from object detection to natural language understanding, visual question answering, and more.
Unified Transfomer (UniT): One For All

UniT is composed of two different encoders: one for encoding images with a CNN backbone, and one for encoding text. UniT is based on DETR model and consists of a joint decoder that takes task-specific query embeddings. It also has task-specific output heads. Each of the components is explained below:
1) Image Encoder
The image encoder is necessary for vision-only or vision-and-text tasks. It consists of a CNN backbone to extract the local features followed by a multi-headed self-attention encoder to capture global contextual features. The CNN backbone is made of ResNet-50 architecture with dilation in the last c5 stack and pre-trained for object detection. The feature map produced by this CNN encoder for the image I is flattened and passed on to the visual transformer encoder to generate a set of visual hidden states hv = {hv1, hv2...hvL}. Here L is equal to the HvxWv, where Hv and Wv are the height and widths of the feature map produced by CNN encoder.
Some tasks such as object detection and VQA require task-specific information to be extracted from the images. For this, we also use a task-embedding vector as follows:

Pb-->e is a linear transformation that transforms the dimension of feature map vectors into the hidden dimension of the transformer encoder. wvtask represents the task embedding vector which is concatenated to the transformed feature map vectors. Ev is the visual transformer encoder.
2) Text Encoder
The text encoder is necessary for language-tasks like GLUE, QNLI, SST-2, QQP, and language-and-vision tasks like VQA. It is made up of BERT which is a language model pre-trained using masked language modeling and next sentence prediction. Given a sequence of words, the words are tokenized into S tokens {w1,w2...ws} where the first token w1 is a special token [CLS] (for classification in BERT). Similar to the image encoder, we also append task-specific embeddings vectors wttask to the sequence of words. Once BERT encodes the sequence, the encodings corresponding to the wttask vectors are removed as shown below.

It was also found that using only the encodings corresponding to just the [CLS] token worked very well which also saves some computation.
3) Domain-agnostic Decoder
Unlike the encoders, for our main model, the same decoder is used for all modes. For vision tasks, only encodings from the vision encoder are used and for language tasks, only encodings from the text encoder are used. The two encodings are concatenated together for language-and-vision tasks.
The transformer decoder takes in the encodings henc from the encoders and task-specific query embedding sequence qtask of length q. Each of the l transformer decoder layers outputs a sequence of decoded hidden states hdec,l also with length q. Self-attention is applied among the decoder hidden states hdec,l, and cross-attention is applied to henc.
4) Task-Specific Output Heads
Each task t has its own prediction head. For object detection, the output head predicts q bounding boxes for each hidden state in the decoder output along with a class prediction for each of those q boxes. Each position can predict either a class or background. For some datasets like the Visual Genome Dataset with attribute annotations on each box, we also add an attribute classification output head.
 All cl, bl, and al have the same sequence length q like the query-embedding qtask for detection.
All cl, bl, and al have the same sequence length q like the query-embedding qtask for detection.
All other tasks like natural language understanding, VQA, and visual entailment can be modeled as a classification task among ct classes for task t. For classification, we use the output from the first decoder h1dec, top and pass it through a two-layer MLP with GeLU activation. The class prediction p is trained using the cross-entropy loss with ground truth targets t.
Experiments and Evaluation

Sample Results on Various Datasets
UniT was jointly trained on multiple tasks across several domains. It was trained and validated for object detection(COCO dataset and Visual Genome-VG dataset), natural language understanding(GLUE benchmark: QNLI, QQP, MNLI-mismatched, and SST-2), and joint vision-and-language tasks(VQAv2 dataset, SNLI-VE). VQAv2 has questions and answers from the visual genome dataset, and SNLI-VE dataset requires classifying if the image entails, contradicts, or is neutral to the text description.

The above table shows the results of analysis on object detection and VQA. Shared means the same decoder was used for all tasks and separate means independent decoders were trained. The shared decoders were initialized using weights from the model pre-trained on COCO dataset. Joint training with shared decoder performs well on COCO and VG and is beneficial for VQAv2. VQAv2 accuracy is highest for joint training with separate encoder. Overall, it can be said that joint training helps object detection and VQA.

The above table shows the results of the UniT model on all 8 benchmarks. The results are highly competitive with SOTA models: BET, DETR, VisualBERT. Due to low cross-modality overlap, the UniT- single-task training beats current SOTA models on several tasks(VG, QNLI, MNLI, QQP). Also, it is necessary to note that UniT- single-task was trained for 500k iterations on the same dataset while the multi-modal UniT was also trained for a total of 500k iterations across different tasks. All hyperparameters were kept the same across all UniT training tasks. The UniT-shared has 8x fewer parameters since it can perform the same number of tasks as 8 models combined, with comparable accuracy.
Conclusion
The UniT model can perform 7 different tasks across 8 different datasets and the performance is comparable to task-specific SOTA models. This brings us a step closer to modeling an artificial general intelligence(AGI) system which has the flexibility and abstraction capabilities of the human brain. Future research works should aim to incorporate more modalities like Speech recognition, translation, game playing capabilities, Image generation, into the system and furnish this quasi-AGI even more. Recent endeavors have enabled transformers to be universally used in these tasks (even generating images through TransGAN). So, it is likely that these tasks could also be incorporated into the transformer model. That will make the model capable of covering a wider spectrum of tasks that humans can do. For further information please refer to the original paper.
Categories related to this article








 
    
  
 ![[MusicLM] Text-to-Mu](https://aisholar.s3.ap-northeast-1.amazonaws.com/media/October2023/musiclm-520x300.png) 
  
  
 