Catch up on the latest AI articles

Multimodal End-to-end Transformer

Multimodal End-to-end Transformer


3 main points
✔️ Framework for considering how to train ViT-based VLP models end-to-end
✔️ Consider decomposing model design into four components
✔️ Pre-training with 4M images achieves performance comparable to state-of-the-art models

An Empirical Study of Training End-to-End Vision-and-Language Transformers
written by Zi-Yi DouYichong XuZhe GanJianfeng WangShuohang WangLijuan WangChenguang ZhuPengchuan ZhangLu YuanNanyun PengZicheng LiuMichael Zeng
(Submitted on 3 Nov 2021 (v1), last revised 18 Mar 2022 (this version, v3))
Comments: CVPR2022.

Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

As a solution for visual-linguistic (VL) tasks such as visual question answering (VQA) and image-text search, visual-and-language pre-training (VLP), which uses a large number of image-caption pairs to learn multimodal representations, has shown excellent performance.

It is a natural question whether it is possible to use ViT (Vision Transformer), which has shown promising results in computer vision, for this VLP.

In the paper presented in this article, we proposed the Multimodal End-to-end TransformER (METER) framework and investigated how to pre-train a complete Transformer-based VLP model end-to-end.

We then performed various analyses to obtain better performance, including multiple elements of model design, and combined the results to achieve an accuracy of 77.64% on the VQAv2 tested set, outperforming existing state-of-the-art methods.

About the VLP Model

First, we describe each of the three typical visual-and-language pre-training (VLP) models.

This table summarizes the encoders (Vision Encoder and Text Encoder) used for visual and text feature extraction, the Multimodal Fusion used to fuse these features, the Decoder, and the pre-training objectives.

Object detection (OD)-based region features

In many previous studies, pre-trained object detection models are used to extract visual features (OD of Vision Encoder in Table)

The drawbacks of this approach are that the extraction of region features is time-consuming and the pre-trained ODs are frozen during pre-training, which limits the capacity of the VLP model.

CNN-based grid features

To overcome the shortcomings of OD-based methods and to perform end-to-end pre-training, for example, PixelBERT and CLIP-ViL feed grid features directly from CNN and text to the Transformer.

Using grid features directly is efficient, but usually different Optimizers are used for CNN and Transformer. For example, for PixelBERT and CLIP-ViL, AdamW is used in Transformer and SGD in CNN.

ViT-based batch features

As a case of using ViT-based features, for example, ViLT feeds image patch features and text token embedding directly into the ViT model. Visual Parsing and ALBEF also use ViT as an image encoder.

However, these models lag behind state-of-the-art performance in downstream tasks such as VQA. In the paper presented in this article, we investigate an end-to-end ViT-based model pre-training method that achieves high performance while maintaining fast inference speed.

METER Framework

The paper focuses on the method using ViT-based batch features among the three types of VLP models mentioned earlier.

For this purpose, the paper proposes the METER framework as shown in the figure below.

Overall, given a textual sentence $l$ and an image $v$, the VLP model firstly considers the textual features $l=\langle l_1, ... , l_N \rangle$ and visual features $v=\langle v1, ... , v_M \rangle$ are extracted by textual and visual encoders.

These are passed through a multimodal fusion module to produce a cross-modal representation and optionally further passed through a decoder to produce the final output. Based on this framework, various analyses are performed to obtain a good end-to-end ViT-based VLP model.

model architecture

For the design of the model architecture, there are four major components: optical encoder, text encoder, multimodal fusion module, and whether to use a decoder.

Visual encoder

When using ViT-based batch features, the image is split into patches (16x16), which are input to the Transformer model.

However, as mentioned earlier, existing VLP models based on ViT-based methods perform poorly compared to the state-of-the-art, and it is not known which of the various pre-trained ViT variants is the best model for VLP.

In the paper, the following ViT-based methods are used as visual encoders for the analysis

  • ViT
  • DeiT.
  • Distilled-DeiT.
  • CaiT.
  • VOLO
  • BEiT
  • Swin Transformer
  • CLIP-ViT

text encoder

The VLP model generates input text sequences by first splitting input sentences into subword sequences and then inserting special tokens at the beginning and end of the sentence.

The paper uses the following pre-trained language models as text encoders

  • BERT
  • RoBERTa
  • DeBERTa

We also experiment with using only simple word embeddings, initialized with the BERT embedding layer.

Multimodal Fusion Module

Regarding the modules for fusing visual and textual features, the paper considers both of the following two types of fusion modules

The Co-Attention model uses Cross-Attention by feeding two different features into separate Transformer blocks, while the Merged Attention model simply concatenates the two features into a single Transformer block.

Encoder only or encoder/decoder

In the existing VLP model, there are two cases: with and without a decoder before the final output layer. In our experiments, we consider both cases with and without a decoder, as shown in the figure below.

Prior Learning Objectives

Next, we consider the model's prior learning objectives.

Masked Language Modeling(MLM)

In MLM in VLP, given an image-caption pair, we first randomly mask some of the input tokens.

Next, we train to recover the original input token $l$ from the masked token sequence $l^{mask}$ and image $v$.

Image-Text Matching (ITM)

In image-text matching, given an image-caption pair, we learn a binary classification problem that identifies whether the caption corresponds to the image.

Masked Image Modeling(MIM)

Instead of MLM, MIM learns to mask a part of the image and recover its region features. Instead of region features, it may predict object labels in the masked region.

However, some studies question whether MIM is effective for VLP, as the latest VLP models do not apply MIM.

For further study, the paper treats MIM as a patch classification task and analyzes two implementations

Masked Patch Classification with In-batch Negatives

First, we let the model recover input patches using a dynamic vocabulary constructed from the negatives in the batch.

Specifically, suppose that a batch $\{\langle v^k, l^k \rangle\}^B_{k=1}$ of image-caption pairs are sampled at each training step ($B$ is the batch size). Let $\{v^k\}^B_{k=1}$ be a candidate set of all image patches in $\{v^k\}^B_{k=1}$, and predict which of the randomly masked patches falls in the candidate set.

Masked Patch Classification with Discrete Code

Second, we train the model to obtain a discrete representation of the input patches and recover them. Specifically, we transform each image into a series of discrete tokens using VQ-VAE in DALL-E and resize the image size so that the number of patches matches the number of tokens.

We then predict the discrete tokens corresponding to the randomly masked patches.

Default METER settings

If not mentioned otherwise, the default settings of METER in the experiment are as follows

  • For the model architecture, the encoder consists of six Transformer encoder layers, each layer consists of a self-attention block, a cross-attention block, and a feed-forward block. The hidden side of the top layer is 768 and the number of heads is 12.
  • Two prior learning objectives will be used, MLM and ITM.
  • We use four pre-training datasets: COCO, Conceptual Captions, SBU Captions, and Visual Genome.
  • For the downstream tasks, we focus mainly on VQAv2. We also evaluate NLVR2, SNLI-VE, COCO, and Flickr30k for comparison.
  • In the pre-study, we use AdamW to learn 100k steps.


About visual and text encoders

First, we investigate the impact of visual and textual encoders. Here, given the large cost of pre-training, we conduct our study without using VLP. Specifically, we initialize the lower layers with pre-trained visual and textual encoders and the upper layers randomly and fine-tune the model in a direct downstream task.

The visual and text encoders are affected by the following, respectively

When the model was optimized directly without VLP, the Swin Transformer and CLIP-ViT were particularly useful for visual encoders.

On the other hand, there was no significant difference between the text encoders, but RoBERTa was seen to be the most robust. The performance of the Emb-only setting decreased, suggesting that it is important to use a pre-trained text encoder. In light of these results, the results for the case using VLP are as follows.

In particular, CLIP-ViT-224/16 achieves VQA scores of 77.19/77.20 for the test-dev/test-std sets, respectively, outperforming the existing state-of-the-art method, VinVL.

About Multimodal Fusion Module and Decoder

The comparison results for the aforementioned multimodal fusion module and decoder are as follows.

Experimental results show that the Co-attention model performs better, suggesting that it is important to have different parameter sets for each of the two modalities.

The encoder-only model without a decoder also shows better results. However, it should be noted that the encoder-decoder model has some advantages, such as the flexibility to perform tasks such as image captioning, which are difficult to apply in the encoder-only model.

Preliminary Learning Objectives

The following table shows the change in performance according to the prior learning objectives.

Experimental results show that MLM and ITM improve the downstream task performance, while MIM degrades the performance. This result indicates that the findings of region-based VLP methods may not be valid for ViT-based methods.

In addition, this performance degradation appears to be due to contention between different objectives, which may be resolved by techniques such as multitask optimization.

Comparison with existing methods

Finally, we compare the best performing model (RoBERTa-base+Swin Transformer/CLIP-ViT-224/16) in our experiments so far with the existing method. The results are as follows.

In general, the CLIP-based model among the proposed methods achieved the best or second best performance on all downstream tasks in comparison to the pre-trained models with less than 10M images.

The results of pre-training with more images (14M images and 20M image-caption pairs) and larger backbones (CoSwin-Huge, RoBERTa-base) are also shown below.

Experimental results show that the proposed method outperforms the existing methods trained on 1.8B images, indicating that the proposed method is scalable.


We presented a paper on VLP models for solving visual and verbal multimodal tasks, investigating how to pre-train a complete Transformer-based VLP model end-to-end.

Comprehensive experiments have revealed insights into the effective model design for ViT-based VLP models and achieved performance comparable to state-of-the-art methods for pre-training with 4M images. This work paves a new way for ViT-based visual and verbal pre-learning methods.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us