Now There's A Model That Generates Emotions And Descriptions Recalled From Real-world Images!
3 main points
✔️ Proposed Affective Explanation Captioning (AEC), a task that generates emotions and explanations recalled from real-world images
✔️ 6283 annotators felt emotions and explanations for 85007 real-world images created Affection, a large dataset of annotated feelings and descriptions of 85007 real-world images by 6283 annotators
✔️ Turing test showed that about 40% of evaluators could not discriminate between a neural speaker created using Affection and a human
Affection: Learning Affective Explanations for Real-World Visual Data
written by Panos Achlioptas, Maks Ovsjanikov, Leonidas Guibas, Sergey Tulyakov
(Submitted on 4 Oct 2022)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
The images used in this article are from the paper, the introductory slides, or were created based on them.
In recent years, inspired by research on the prediction of emotions evoked from visual art and the generation of explanatory text for these emotions, there has been a growing body of research linking the emotional responses evoked by images with explanatory text in natural language.
Although models have been developed in the past to classify emotions from images, learning emotions through natural language enriches the nuances of the resulting emotional analysis and enables the generation of human-like descriptions.
In this paper, we propose a task (Affective Explanation Captioning, AEC) that extends such emotion prediction and explanation generation to real-world images, rather than limiting it to visual art as in existing research, and describe a large-scale dataset for this task, Affection Affection, a large dataset for this task, and the creation of a neural speaker using this dataset will be presented.
The Affection (Affective Explanations) dataset was constructed from images in the existing publicly available datasets MS-COCO, Emotional-Machines, Fllickr30k Entities, and Visual Genome. It was constructed based on images used in existing studies by Quanzeng et al.
Specifically,6283 annotators with various opinions, personalities, and preferences were asked to choose one of four positive emotions ( amusement, awe, contentment, and excitement ), one of four negative emotions (anger, disgust, fear, and sadness), or something else, in response to 85007 real-world images from five data sets. The four positive emotions were amusement, awe, contentment, excitement, and one negative emotion was anger, disgust, fear, sadness, or something else.
As a result, 71.3% of positive and 21.1% of negative emotions were annotated for all images, as shown in the graph below.
Then, by adding text describing the emotion in detail, the image/description pairs shown in the figure below were collected.
The above figure shows a pair of images and explanatory text for the word "bird. The annotations are characterized by the fact that they include inferences beyond what can be recognized from the image.
In addition, the table below shows that Affection has a richer vocabulary and more complex corpus than the existing data set.
Affective Explanation Captioning
To perform the task of generating the emotions and descriptions to be recalled from real-world images, the following two models must be combined.
- A model that, given a real-world image and its description, predicts the distribution of emotions recalled from it
- A model that, given a real-world image, generates a descriptive sentence that includes the emotion evoked from the image.
Each of these will be explained.
Basic Classification Tasks
In this paper, we follow existing research and denote models that predict emotion from input text as Cemotion|text, and models that predict emotion from input images as Cemotion|image.
Cemotion|text used an LSTM-based text classifier trained from scratch with standard cross-entropy loss to predict the nine emotion classes annotated with Affection.
Cemotion|image employs ResNet-101 pre-trained with ImageNet to predict the appropriate emotion distribution for the input image, using KL-divergence of the annotated and predicted emotion distribution with Affection as the loss Fine-tuning was performed.
Neural Listeners and Speakers
As the base of the generative model in this paper, we use SAT (Show-Attend-and-Tell ), a simple and high-performing model widely used in existing research. Specifically, at each time step, the system learns the attention to the image information encoded by ResNet-101 in Cemotion|image and predicts the next token by combining the current input token with the hidden state of LSTM.
This allows us to make appropriate emotional predictions and generate explanatory text for a given image, as shown in the figure below.
In addition, it is noteworthy that by learning emotions through language, the prediction is more nuanced than existing models that classify emotions from images alone.
Taking the second dog image from the left in the bottom row of the figure below as an example, we can see that, unlike the existing model, the prediction is more human-like: dog growling and showing its teeth → possibility of hurting someone → Fear.
In addition, as in existing studies, it is possible to control the emotion distribution obtained from the Cemotion|image to generate explanatory text containing arbitrary emotions.
Emotional Turing test
In this paper, we conducted a Turing test to evaluate how well the created neural speaker can generate human-like sentences.
Specifically, we evaluated four models: a model using essential SAT (Default), a model using ResNet-101 with added emotional information (Emo-Grounded), a model using the CLIP model to rank and output the most appropriate generated sentences (Default-Pragmatic), and a model using both Emo-Grounded and Default-Pragmatic (Emo-Grounded Pragmatic), and a model that uses both methods (Emo-Grounded Pragmatic ).
The test procedure is as follows
- Randomly create 500 test images and add one human-generated description to each image
- Associate the created image and description with the description generated by the neural speaker
- For these sample data sets, the annotator selects whether they are human or neural speaker descriptions
The results of the Turing test are shown in the figure below.
As can be seen in the figure, more than 40% (41.1%-46.2%) of the respondents evaluated that both explanations given in all models were created by humans, demonstrating that the Affection dataset and the neural speakers using it can generate explanations that are comparable to those of humans. The results demonstrate that the Affection dataset and the neural speaker using it can generate explanatory text comparable to humans.
How was it? In this article, we described a paper that extended the existing research on sentiment analysis and description generation for visual art to real-world images and created a large dataset for this task, Affection, and a neural speaker using this dataset.
We look forward to future progress in this research, as it will lead not only to the generation of descriptive text from images but also to a more comprehensive understanding of image content and how its elements affect human emotions.
The details of the architecture of the data sets and models presented in this paper can be found in this paper for those who are interested.
Categories related to this article