Caption Generation With Diversity Reflecting Local Style Information Of The Image Is Now Possible!

Image Caption 28/10/2022

3 main points
✔️ Proposed Style-SeqCVAE, a Variational Autoencoder (VAE)-based framework for encoding local style information of input images
✔️ Proposed an annotation extension method to obtain captions with various styles from the COCO dataset
✔️ Experiments with the Senticap and COCO datasets enable caption generation with a variety of styles

Diverse Image Captioning with Grounded Style
written by Franz Klein, Shweta Mahajan, Stefan Roth
(Submitted on 3 May 2022)
Comments: GCPR 2021
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In recent years, the development of multimodal datasets that integrate vision and language has led to the proposal of various models in image captioning (the task of generating a semantic description of a scene in the natural language given an image of that scene).

However, the dataset available for such image captioning is the COCO dataset consists of multiple captions per image by different annotators, and captioning frameworks using such datasets deterministically generate a single caption per image, as listed in

To address these problems, a variety of image caption generation methods have been proposed using a framework that generates multiple captions for a single image, but these approaches take little account of the input image and text styles (variations in sentence structure and changes in linguistic style due to attention to multiple local features ) and can only generate a single caption associated with a single emotion category extracted from the image. language style by focusing on variations in sentence structure and multiple local features), and could only generate a single caption associated with a single emotion category extracted from the image.

The Style-SeqCVAE presented in this paper encodes style information into the latent space of the Variational Autoencoder and sequentially structures the latent space according to the local style information of the input image, thereby creating a model that enables the generation of various styles of caption generation from the input image by sequentially structuring the latent space according to the local style information of the input image.

Problems with existing image caption datasets

One of the main problems with existing image caption datasets is that the annotated caption text may not be related to the actual caption represented in the image.

As an example, in the image from the Senticap dataset below, the caption text incorrectly refers to the man on the left as a dead man and this kind of incorrect image-caption relationship can hurt the generated captions.

Another problem is the bias in the frequency of positive and negative capt ions: in the Senticap dataset, there are 842 positive adjective-noun pairs (ANPs), consisting of 98 adjective-noun combinations and 270 nouns, while there are only 468 negative ANPs, consisting of 117 adjectives and 173 nouns. consisting of 98 adjectives and 270 nouns, while only 468 negative ANPs are consisting of 117 adjectives and 173 nouns in the Senticap dataset.

To remedy these problems, this paper proposes an extended method for the COCO and Senticap datasets.

Extension of the COCO dataset

In this paper, to generate various styles of captions generation for the COCO dataset, the following methods were used to extend the data.

To address the lack of annotations for style-aware captions in the dataset, we combine COCO captions focused on scene composition with adjectives for style expression in COCO Attributes
Eliminate 98 categories (such as "cooked") that are not relevant for style-aware caption generation
Define sets of synonyms among the remaining categories to increase diversity

In addition to this, we created a dataset of captions to account for the styles in the image by following these steps

For each object category in the COCO dataset, define a set of interchangeable nouns with corresponding captions
Given an input image, associated objects and labels, and a caption for ground truth, find nouns in the caption that also appear in the set of object categories defined above and insert adjectives sampled from the annotation set before the noun

The dataset created by these methods looks like the following, which confirms that we have extended the caption to take into account the style of the image style caption.

Overview of Style-SeqCVAE

To obtain image-based styled captions, our method first extracts features related to the objects in the input image and then uses these features to formulate Style-SeqCVAE as a structured latent space to encode image-based local style information.

The purpose of Style-SeqCVAE is to generate captions that reflect the various style information contained in the image, and the overall model is shown below.

Given an input image I and a caption sequence x = ( _x1,...., _xT), visual features _{v1,...,vk} of K regions in the image are extracted from the Faster R-CNN and the averaged image features are input to the attention LSTM as shown in the figure. , _vk} are extracted from the Faster R-CNN and the averaged image features are input to the attention LSTM.

In this work, we also propose to further encode the region-level style information into c(I _)t and update it at each time step using attention weights( _αt).

It is based on the assumption that image styles can differ significantly between different regions, and to account for this we model VAE with an explicit latent space structure with L STM-based language encoders and language decoders (yellow-colored areas in the full model), where the figure h_t^attention、h_t^encoder、h_t^decoder represent the respective LSTM hidden vectors at time step t.

Experiments

To evaluate this approach for generating a variety of captions in an image-based style, experiments were conducted using the Senticap and COCO datasets extended in this paper for the datasets and Bleu ( B), CIDEr ( C), ROUGE ( R), and METEOR ( M ) were used in the experiments.

Evaluation of the Senticap dataset

Since the Senticap dataset consists of positive and negative captions for images, previous studies have generated positive and negative captions for a given image based on the style index, and this experiment is also based on the latent space structured by Style-SeqCVAE. based on the latent space structured by Style-SeqCVAE.

The results are shown in the table below. (n is the number of captions generated per input image)

From the table, it is observed that when only one caption is generated for one image (n=1), the proposed approach scores similar to the existing studies, but when 10 captions are generated for one image (n=10), the proposed approach scores better than the existing studies that the proposed approach has a better score compared to the existing studies.

This suggests that, unlike our approach, existing studies do not encode as many style variations for a given image, and that our data extension technique allows us to insert appropriate adjectives related to the style in the appropriate places in the caption The data extension technique of our method allows us to insert appropriate style-related adjectives in the right places in the captions.

An example of the generated caption is shown in the figure below.

Thus, we can confirm that our method can generate captions with various styles that accurately reflect the positive and negative emotions in the image.

summary

How was it? In this article, we have described a paper that proposed Style-SeqCVAE, a Variational Autoencoder (VAE)-based framework for encoding local style information in input images.

Compared with existing research, this method enables us to generate more human-like captions that reflect various features of input images, and we are very much looking forward to future progress.

The details of the architecture and generated samples of the model introduced in this article can be found in this paper if you are interested.

Categories related to this article

田中侑李