Catch up on the latest AI articles

ArtEmis: Explaining Art With AI

ArtEmis: Explaining Art With AI


3 main points.
✔️ Generated a large dataset: ArtEmis, labeled with emotions felt towards visual art such as paintings and their descriptions
✔️ Trained on ArtEmis to create a model that predicts emotions towards it from images and sentences
✔️ Further trained with neural speaker to predict images to generate sentences that describe images using metaphorical expressions.

ArtEmis: Affective Language for Visual Art
written by Panos AchlioptasMaks OvsjanikovKilichbek HaydarovMohamed ElhoseinyLeonidas Guibas
(Submitted on 19 Jan 2021)
Comments: Accepted to arXiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)

First of all

Emotions are one of the most common human characteristics and are shared through language. In this thesis, we specifically aimed to learn viewers' emotions towards visual art and the sentences that explain them. visual art was used in an attempt to understand human emotions towards images, since visual art is created by the author with the aim of working on the viewer's consciousness and contains abstract expressions that require complex explanations. emotion to the image. To enable these studies, we created ArtEmis, a dataset labeled with a large number of images, their emotions, and descriptions. We then used ArtEmis to create a classifier that predicts emotions from images and sentences, and a Neural speaker that generates emotion-based descriptions.


Images from ArtEmis used the public data WikiArt and asked at least five people to describe their feelings about each image and why they felt that way. They were asked to choose from four positive emotions: 'Amusement', 'Awe', 'Contentment', and 'Excitement'; four negative emotions: 'Anger', 'Disgust', 'Fear', and 'Sadness'; and one negative emotion: 'Something Else'. We asked them to choose from 'Something Else'. The figure below shows the labelled sample. As you can see, there are many abstract expressions in the descriptions; in total, ArtEmis consists of 439121 sentences of description.


Compared to traditional datasets such as COCO, ArtEmis sentences have not only more characters but also more word types, making it a more expressive dataset. Furthermore, it contains a very large number of emotional expressions. The figure below shows the histogram of the 'concreteness', 'subjectivity' and 'emotion' scores analyzed using the emotion analyzer VADER, compared to COCO, showing that ArtEmis is more abstract, subjective and emotional.

The distribution of emotion categories in ArtEmis is shown in the figure below. On the whole, more positive emotions are selected than negative ones. However, it is interesting to note that 61% of the images Positive emotion and at least one negative emotion was selected. Furthermore, there was a high degree of agreement among the labelers, with 45.6% of the images being assigned the same emotion by the majority of the respondents.


To assess the validity of ArtEmis' descriptions, we presented unlabeled subjects with a random image and description and asked them if it was a valid description of this image, and 97.5% of them said yes. In addition, when subjects were presented with multiple images and descriptions in a random order and asked to make the correct combination, 94.7% of the images were correctly predicted. The above results show that we are able to label the images very well despite the high level of abstraction.


Emotional predictor

As emotion prediction for data, we considered the problem of predicting emotion given a sentence and the problem of predicting emotion distribution given an image. The former is represented as $C_{emotion|text}$ and was created by creating a text classifier using LSTM with cross entropy as the error function and fine-tuning a trained BERT model. The latter is represented as $C_{emotion|image}$, with the KL divergence between the output and the actual distribution as the error function. The latter model is represented as $C_{emotion|image}$ and was created by fine-tuning the trained ResNet encoder.

Neural Speaker

First, for comparison, we created an Adjective-Noun Pairs-based Neural speaker (ANP) trained on the COCO dataset without ArtEmis to learn adjective-noun pairs. Adjective Noun Pairs based Neural speaker (ANP).

Basic speaker

For the model to train ArtEmis, we used Show-Attend-Tell (SAT), which combines an image encoder and an LSTM with an attention mechanism, and Meshed-Memory Transformer ($M^2$), which replaces the recursive structure with a transformer and uses bounding boxes computed separately by the CNN Meshed-Memory Transformer($M^2$). In addition, we created Nearest-Neighbor(NN), which does not perform the training itself, but extracts the nearest neighbors to the test data from the training data and outputs them.

Grounded speaker

Furthermore, when learning SATs, it is important to to emotion label prediction. We created a model that can generate sentences for arbitrary emotions by adding features extracted from the full association layer.

Evaluation method

Three Metric methods, 'BLEU1-4', 'ROUGE-L' and 'METEOR', were used to quantitatively evaluate the Neural speaker. These represent linguistic similarity, with The higher the value, the better the agreement with the ground truth. Other metrics include the length of common terms between the generated sentences, the percentage of metaphorical expressions, and the percentage of the predicted sentiment for the generated sentences that matches the ground truth (' Emo-Align' ), etc. were evaluated. Furthermore, as an experiment to see the difference between machine and human emotions, we imitated the Turing test and had humans predict whether the generated sentences were human or machine ones.


Emotional predictor

It is very difficult to classify each emotion in Positive and Negative accurately because they are similar. The accuracy of the two-class classification of Positive and Negative was about 90%.

Neural Speaker

The performance of each Neural Speaker is shown in the table below.' The values of linguistic similarity such as 'BLEU' are lower than those evaluated on traditional datasets such as COCO due to the high abstraction level of ArtEmis. In addition, there is a difference in the evaluation performance between models trained with ArtEmis (Basic, Grounded) and without ArtEmis (NN, ANP). Furthermore, 'Emo-Align' shows that sentences generated by specifying emotions have much higher performance than those generated by not specifying emotions.

As a qualitative evaluation, the following figure shows the sentences generated for the test image using Grounded speaker. It can be seen that a high level of expression based on the specified emotion has been achieved.

Turing test

Interestingly, 50.3% of the sentences generated by the Ground speaker and 40% of the sentences generated by the Basic speaker were determined to be human-generated.


Human perception and emotion are underdeveloped areas in AI. To address these issues, in this paper we generated ArtEmis labeled with sentiments about visual art and sentences that describe them. We then experimented with building a Neural speaker to predict emotions and describe explanations. While these results showed that we were able to generate sentences that were partially indistinguishable between humans and machines, they were still far from human sentences in terms of diversity and creativity.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us