A Model That Can Predict The Trajectory Of The Eye From The Input Image!

Transformer 19/10/2022

3 main points
✔️ Proposed two novel tasks: generating traces with image and caption as input, and generating captions and traces with image only as input
✔️ Proposed MIrrored TransformeR (MITR), a transformer architecture for jointly learning images, captions, and traces
✔️ Experiments on four existing datasets demonstrate the effectiveness of our approach

Connecting What to Say With Where to Look by Modeling Human Attention Traces
written by Zihang Meng, Licheng Yu, Ning Zhang, Tamara Berg, Babak Damavandi, Vikas Singh, Amy Bearman
(Submitted on 12 May 2021)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In the past, there have been only occasional overlaps in the development of models and algorithms in the fields of computer vision and natural language processing, but in recent years there has been a gradual convergence of ideas in these two fields.

In particular, the focus is on building multimodal models to align vision and language, and the goal of these models is to mimic the extraordinary ability of humans to compress and translate information across modalities.

However, despite these advances existing image caption datasets provide only short noun- or phrase-level captions, and the previous image captioning and visual grounding models are unable to jointly generate long natural language captions and highly accurate word-level visual grounding.

This paper describes a paper that solves the above problem using a novel Transformer architecture that jointly models the three modalities of image, caption, and trace.

Overview of datasets and new tasks

First, we describe the Localized Narratives dataset used in this paper and the novel task using it.

The Localized Narratives dataset was collected by simultaneously recording the annotator's voice and mouse traces as he or she described the content of the image The dataset consists of three modalities: image, caption, and trace.

While the original paper on this dataset dealt with only a single task of generating captions from images and traces, this paper proposed two additional novel and challenging tasks

Generating traces with image and caption as input
Generation of captions and traces using only images as input

This is shown in the figure below. (Rows 1 and 3 in the table are new tasks)

Although these three tasks may appear to be separate at first glance, this paper proposes a unified framework to model the three tasks jointly by using a novel model architecture This paper proposes a unified framework to model the three tasks jointly by using a novel model architecture

MIrrored TransformeR(MITR)

In this paper, instead of building three separate models for the above three tasks, we proposed a model that effectively learns in a unified framework with shared parameters, and named this model architecture Mirrored TransformeR (MITR ) due to its symmetric structure (See the figure below). (See the figure below)

feature value

The input to the model is a subset of image features, text features, and trace features, each of which is

For image features, pre-trained Faster R-CNN is used to compute the visual elements of the detected region
For text features, as in existing studies, we sum positional embeddings and word embeddings
For the trace features, we add the positional embeddings and the projection of the input trace onto d hidden dimensions

Model Architecture

This model consists of three modules: ( 1) image encoder,(2) caption encoder-decoder, and(3) trace encoder-decoder. (See the figure below)

Let the input image features, text features, and trace features be denoted as _xv,_{xw, and}_xr respectively .image encoder _hv is defined as follows

Here, following existing studies, we define the feed-forward network (FFN) as two linear transformation layers with a ReLU activation function in between and define MultiHead as follows.

Also, the caption encoder-decoder _hw and trace encoder-decoder _hr are defined as follows

These modules are designed to take a mirroring structure such that the two modalities are symmetric in the two tasks of caption generation and trace generation.

Also, by performing the masking operation proposed in existing research, where the encoder refers to all input and the decoder refers to only partial past information, the above two modules have the feature that they can seamlessly switch between the roles of encoder and decoder The above two modules have the feature that they can seamlessly switch between the roles of encoder and decoder.

Total Loss Function

The final loss function can be formulated as

where L _{[trace] is} the L1 loss between the predicted trace box and the ground truth trace box in trace generation, L[cap tion _{] is} the cross-entropy loss of the caption in caption generation, _{^{Lr~→w^→r^}} is the cycle loss, and L _{[joint] is} the sum of the trace loss and caption loss in the joint caption and trace generation task.

Experiments

In this paper, we experimented with four datasets, COCO, Flickr30k, ADE20k, and Open Images.

Trace & Caption Generation

The results of trace generation (Task 1) and caption generation (Task 2) using the method proposed in this paper are below shows the results of the trace generation (Task 1) and caption generation (Task 2) using the method

As shown in the figure, the proposed method can obtain accurate generation results for both tasks.

Joint Caption and Trace Generation

The result of simultaneous caption and trace generation (Task 3) is shown in the figure below.

Modeling traces simultaneously with captions resulted in a significant improvement in caption generation performance compared to the baseline where only captions were modeled.

However, in the absence of human trace annotation for caption generation, we sometimes observed defects such as the same object or description being repeated several times in a single caption, suggesting the need for measures such as keeping a record of all referenced objects to avoid such repetition in future development. In the future development, measures such as keeping a record of all referenced objects should be taken to avoid such repetition.

summary

How was it? In this issue. described a paper that proposed Mirrored TransformeR (MITR), a novel transformer architecture that jointly models the three modalities of image, caption, and trace.

Since this model has the potential to be used to solve various social problems, such as automatically generating localized descriptions of images for visually impaired people on social media We will pay attention to future trends.

The details of the architecture and generated samples of the model introduced in this article can be found in this paper if you are interested.