Catch up on the latest AI articles

GRIT, An Image Caption Generation Model That Integrates Two Visual Features And Achieves Significant Accuracy Improvements, Is Now Available!

GRIT, An Image Caption Generation Model That Integrates Two Visual Features And Achieves Significant Accuracy Improvements, Is Now Available!

Image Caption

3 main points
✔️ Integration of two visual features, Grid features, and Region features, achieves significantly better performance than existing methods for image caption generation
✔️ Achieve computational speedup by replacing CNN-based detectors with DETR-based detectors in existing methods
✔️ Transformer-only model structure enables end-to-end learning

GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
written by Van-Quang NguyenMasanori SuganumaTakayuki Okatani
(Submitted on 20 Jul 2022)
ECCV 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Image caption generation is the task of generating a semantic description of a scene in the natural language given an image of the scene, which requires a comprehensive understanding of the scene and a report that reflects that understanding.

The most important problem in this task is how to extract good features from images, and existing research has taken two main approaches to this problem

  • Grid features: local image features extracted from regular grid points
  • Region features: local image features of the bounding box detected by the object detector

Current state-of-the-art methods for image caption generation use region features to directly encode the detected object regions, while

  1. Cannot cover the region between objects, so contextual information such as the relationship between objects cannot be obtained
  2. There is a risk of object false positives such as missing important objects in the image.
  3. The computational cost is enormous.

and so on. (This is especially true when using high-performance CNN-based detectors such as the Faster R-CNN)

On the other hand, since Grid features are extracted from the entire image, obtaining contextual information such as the relationship between objects in the image may solve the above two problems, and there have been studies on integrating these two features, but the best method is still unknown.

The GRIT (Grid- and Region-based Image captioning Transformer ) introduced in this paper consists of a Transformer-only architecture that integrates these two visual features. As shown in the figure below, it is an end-to-end model that achieves significant computational speed and performance improvements compared to existing methods. The end-to-end model achieves a significant improvement in computational speed and performance compared to existing methods, as shown in the figure below.

GRIT: Grid- and Region-based Image captioning Transformer

GRIT consists of two mechanisms: one to extract two visual features from an input image and the other to generate a caption sentence from the extracted features. (See the figure below)

Feature Extractor

Like traditional image captioning methods, this method uses an object detector to extract r region features, but it uses an object detector to extract r in the conventional SOTA image caption model. Instead of CNN-based detectors such as Faster R-CNN, which is used in the conventional SOTA image captioning model, we adopt DETR, which is a Transformer-based framework.

This enables end-to-end learning of the entire model from the input image to the final output, the generated caption, and significantly reduces the computation time while maintaining the model's performance on image captions compared to the SOTA model.

Specifically, we perform pre-learning on object detection according to the learning method of Deformable DETR, a variant of DETR. After that, fine-tuning is performed on the combined object detection and object attribute prediction task according to the following loss function.

where P^σ^(i) ( ai) is the attribute probability, P^σ^(i)(ci) is the P^σ^(i) ( ci ) is the class probability, and Lbox( bi,b^σ^(i) ) is the loss to the normalized bounding box regression for object i.

Caption Generator

The caption generator adopts a basic design based on the Transformer architecture adopted in previous studies and takes two types of visual features as input: region features and grid features.

The caption generator then generates caption sentences in an autoregressive fashion, taking the predicted word sequence at time t-1 and predicting the next word at time t.

Specifically, the model is pre-trained using cross-entropy blossoms and fine-tuned by CIDEr-D optimization with a self-critical sequence training strategy, following standard methods in image caption research.

This means that given a ground-truth sentence x*1:T, t = 1, ...., T, the model will be trained to predict the next word x*t in t = 1, T, which is equivalent to minimizing the following loss function concerning the model parameter θ since the model will be trained to predict

The model is then fine-tuned by CIDEr-D optimization, where the CIDEr score is the reward and the mean of the reward is the baseline of the reward according to existing studies. Thus, the loss in self-critical sequence training is expressed by the following equation.

where wi is the i-th sentence in the beam search, r is the reward baseline, and k is the number of samples in the batch.


In this paper, an online evaluation was performed on the COCO dataset (a dataset consisting of 123,287 images with five different caption types), which is a benchmark for image captioning research. In addition, to validate the effectiveness of our method on other image caption datasets, we also evaluate the performance of our model on the nocaps and Artemis datasets.

We also use standard evaluation protocols BLEU@N, METEOR, ROUGE-L, CIDEr, and SPICE as evaluation metrics.

Online evaluation with COCO dataset

In this experiment, we evaluated the results for a single model and an ensemble of six models on 40,000 test images from the COCO dataset and obtained the results shown in the table below.

As shown in the table, our method results in achieving the best scores in all the evaluation metrics.

Performance evaluation with nocaps and Artemis dataset

In addition to the experiments described above, we have conducted two additional experiments: (a) evaluating the model under the same conditions as the COCO dataset, and (b) evaluating the inference performance on zero-shot for the model trained on COCO.

The results of (a) and (b) are shown in the table below.

The results show a significant improvement in performance in both experiments compared to the existing method.

Qualitative Examples

Examples of captions generated by the proposed method (GRIT) and the existing method (M2 Transformer) for the input images of the COCO dataset are shown in the figure below.

It can be observed that GRIT generates very good captions in object detection and object relationship description compared to existing methods.

The inaccuracy of captions generated by existing methods is due to the problem of conventional image caption generation models where region features extracted by pre-trained object detectors cause false detection and lack of contextual information. The results demonstrate that our method eliminates these problems.


How was it? In this article, we have discussed GRIT (Grid- and Region-based Image captioning Transformer), which is a transformer-based image captioning model to extract richer visual information from an input image by integrating the region features and grid features extracted from the input image. captioning Transformer), which is a Transformer-based image captioning model that extracts richer visual information from input images.

The experiments conducted in this paper show that GRIT significantly outperforms existing methods in terms of inference speed and accuracy, and it is a model that successfully solves the question of integrating two visual features, which has been an issue in the past.

We expect that the field of image caption generation will develop more and more based on this method, so we will pay attention to future developments.

The details of the architecture and generated samples of the model introduced in this article can be found in this paper if you are interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us