Catch up on the latest AI articles

Finally, An AI That Understands Sarcastic Dialogue And Can Generate Descriptions!

Finally, An AI That Understands Sarcastic Dialogue And Can Generate Descriptions!

Natural Language Processing

3 main points
✔️ Proposed a new task, SED (Sarcasm Explanation in Dialogue), which aims to generate explanations for sarcastic lines and reveal the intent of the sarcasm.
✔️ Extend the dataset of the existing sarcasm identification task and create a new dataset, WITS (Why Is This Sarcastic), which is annotated by humans.
✔️ Designed Modality Aware Fusion (MAF) as a benchmark for WITS to explain sarcastic expressions in conversation through multimodal context-aware Attention

When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues
written by Shivani KumarAtharva KulkarniMd Shad AkhtarTanmoy Chakraborty
(Submitted on 12 Mar 2022)
Comments: Accepted in ACL 2022.

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Sarcasm in human conversation is essential for smooth communication, whether to express humor or criticism, to express surprise, or to highlight a discrepancy between expectation and reality, and it is very important for dialogue agents to understand these sarcastic lines, and to be able to understand and provide appropriate responses.

Although there has been research on identifying sarcastic expressions from textual and multimodal information in the domain of dialogue systems, for dialogue agents to emulate more human-like behavior, they need to be able to not only identify sarcasm but also to understand sarcastic expressions in their entirety.

The three main contributions of the paper presented in this paper are as follows.

  • We propose a new task, SED (Sarcasm Explanation in Dialogue ), which aims to generate explanations for sarcastic dialogues and reveal the sarcasm's intention.
  • Created a new dataset, Why Is This Sarcastic (WITS ), which extends and human-annotated the existing sarcasm identification task dataset
  • As a benchmark for WITS, we designed Modality Aware Fusion (MAF), which enables the explanation of sarcastic expressions in conversation through multimodal context-aware Attention

Let's look at them in order.

Overview of SED (Sarcasm Explanation in Dialogue)

The figure below shows a sample of SED, the task proposed in this paper to generate explanatory text for sarcastic lines.

The conversation here consists of four statements by two characters, u1,u2, u3, and u4, where the last statement u4 is the one containing sarcastic expressions. (The dataset is in Hindi and the blue text is the English translation.)

In SED, the task is to generate explanatory text for sarcastic utterances by aggregating the conversation history, multimodal information such as intonation and facial expressions, and speaker information, as shown in the Sarcasm Explanation in the figure.

The description contains the following four attributes

  1. Sarcasm Source: A person who is being sarcastic in a dialogue.
  2. Sarcasm Target: person/thing against which the irony is directed
  3. Action word: the verb used to describe how sarcasm is done (mocks, insults, etc.)
  4. Description: A description of a scene that helps the reader understand irony

In the example above, "Indu implies that Maya is not looking good.", Indu is "Sarcasm Source", Maya is "Sarcasm Target", implies is "Action word" and is not looking good " Description".

WITS (Why Is This Sarcastic) Overview

Next, we describe the new dataset, WITS.

Until now, Sitcom (Situational comedies ), a dataset of human behaviors and mannerisms in daily life, has been used in the task of identifying sarcastic expressions.

However, it is not a suitable dataset for SED, the task proposed in this paper is to generate explanatory text for statements containing sarcastic expressions, so the author created a new dataset named WITS (Why Is This Sarcastic ).

The details of WITS will be as follows

  • Extend the existing dataset, the MASAC dataset (Bedi et al., 2021), and augment it with explanatory text
    • MASAC is a multimodal Hindi and English dialogue dataset compiled from popular Indian TV shows
  • The original dataset contains 45 episodes of the TV series, but the authors added 10 more episodes, and their translations
    • Then, we manually select utterances containing sarcastic expressions from this extended dataset
  • Finally, a dialogue dataset containing 2240 sarcastic expressions was created
    • Each of these is manually annotated with a description to interpret its irony

Overview of MAF (Multimodal Aware Fusion)

In this paper, we introduce MAF (Multimodal Aware Fusion ), which consists of MCA2 (Multimodal Context Aware Attention) andGIF (Global Information Fusion), to smoothly integrate multimodal information into the BART architecture. Multimodal Aware Fusion) consisting of MCA2 (Multimodal Context Aware Attention) and GIF (Global Information Fusion).

MCA2 appropriately incorporates multimodal information such as audio and visual cues into the textual representation of the textual dialogue and its audio-visual cues, and GIFs are used to combine the multimodal information into a single textual representation.

The figure below shows the architecture of the model in this paper.

The Multimodal Fusion Block in MAF uses MCA2 (Multimodal Context Aware Attention) to acquire audio-visual cues and then fuses them with audio-visual cues and text acquired using the GIF (Global Information Fusion) Block is used to fuse the audio-visual cues with the text.

The most significant advantage of this module is that it can be easily integrated into multiple tiers of BART and mBART, allowing for the integration of various multimodal interactions.

Qualitative Analysis

In this experiment, the following five main models were used.

  1. BART (Lewis et al., 2020): a model with a standard machine translation architecture, structured as a combination of BERT's Bidirectional Transformer and GPT's Auto-regressive Transformer. We use its basic version in this paper
  2. mBART (Liu et al., 2020): a model that follows the same architecture and objectives as BART, but is trained on a large monolingual corpus of different languages
  3. MAF-TAB BART-based model with MAF module and audio cues
  4. MAF-TVB BART-based model with MAF module with visual cues
  5. MAF-TAVB BART-based model incorporating MAF module with audio-visual cues

The table below samples the best-performing model, MAF-TAVB, and some of the corresponding BARTs.

From these results, we can see that

  • (a) provides an example where there is room for improvement because the descriptions generated by BART and MAF-TAVB are inconsistent and neither is appropriate for the context of the dialogue
  • In (b), we show an example where MAF-TAVB can generate explanatory text that is consistent with the topic of the dialogue, unlike the explanatory text generated by BART
  • In (c), we show an example where MAF-TABB can generate explanatory text that captures sarcastic expressions better than BART

Thus, we can confirm that MAF, which incorporates auditory and visual information, understands sarcastic expressions more appropriately than BART and can generate explanatory sentences.

Human Evaluation

Since the proposed SED task is generative, it also requires human evaluation of the generated results.

Therefore, the user study in this paper was conducted under the following conditions.

  • We selected 30 instances from the test set and conducted a user study with the help of 25 evaluators
  • Evaluators were given transcripts of dialogue containing sarcastic expressions and video clips with audio and asked to rate the generated descriptions
  • Each rater, after watching the video clip, must rate the generated description on a scale of 0 to 5 (5 being the best) based on the following factors
    • Coherency: Evaluate how well the explanation is organized and structured
    • Related to dialogue: Evaluate whether the generated description is related to the topic of the dialogue
    • Related to sarcasm: Measures whether or not the dialog describes something related to the sarcastic expressions in the dialog.

The table below shows the average scores for each of the categories above for the five models.

From these results, we can see that

  • MAF-TAVB was rated as better organized and producing coherent explanatory text when compared to other models
  • MAF-TAVB and MAF-TVB were evaluated to be more focused on the topic of dialogue, with an increase of 0.55 points in the "Related to dialogue" item.
  • In addition, MAF-TAVB was evaluated to be superior to BART in understanding sarcastic expressions, showing an improvement of about 0.6 points in "Related to sarcasm" compared with BART.

Thus, it is proven that the proposed model in this paper can incorporate information that is not explicitly included in the conversation, such as scene descriptions, facial features, and characters' expressions.

However, none of the mean scores in the table indicate a score of 3.5 or higher, suggesting that further study is needed for this task.


What did you think? In the tasks presented here, the performance of generating sarcastic descriptions is not yet high in human evaluation, and there is room for improvement.

However, as this research field continues to develop, it may not be long before an AI capable of understanding sarcastic expressions at a level comparable to that of humans is realized.

The details of the architecture of the datasets and models introduced in this article can be found in this paper if you are interested.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us