A Framework Is Now Available To Generate Images That Reflect The Figurative Expressions Contained In The Prompts!
3 main points
✔️ Proposed a Human-AI collaboration framework to generate images containing visual metaphors from linguistic metaphors
✔️ A dataset, HAIVMet (Human-AI Visual Metaphor), consisting of 6476 images containing visual metaphorsHAIVMet (Human-AI Visual Metaphor), a dataset consisting of 6476 images containing visual metaphors
✔️ Experimental results show that HAIVMetrepresents visual metaphors better than existing models
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors
written by Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, Marianna Apidianaki, Smaranda Muresan
(Submitted on 24 May 2023 (v1), last revised 14 Jul 2023 (this version, v2))
Comments: ACL 2023
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
Introduction
Visual metaphors are a powerful expressive technique used to convey messages and creative ideas through images, and like linguistic metaphors, they have been used frequently in advertising and creative writing.
In addition, in recent years, the use of generative AI in advertising and creative work has become more common, and these expressive techniques allow for more compelling image generation.
On the other hand, recent diffusion model-based generative AI such as MidJourney and Stable Diffusion can generate higher quality images than VAE and GAN, but they are unable to capture the abstraction of the linguistic metaphors in the prompts. This problem makes it difficult to generate images with visual metaphors.
To solve this problem, we developed a Human-AI collaboration framework that combines a large-scale language model and a diffusion model to create HAIVMet (Human-AI Visual Metaphor), a dataset consisting of 6476 images containing visual metaphors, and generated images containing visual metaphors. This paper introduces a paper that created HAIVMet (Human-AI Visual Metaphor ), a dataset consisting of 6476 images containing visual metaphors, by building a Human-AI collaboration framework that combines a large scale language model and a diffusion model to solve this problem.
Problems with Generative AI
Large-scale generative AIs based on diffusion models such as MidJourney and Stable Diffusion have attracted attention because of their ability to generate high-quality images conditional on input prompts.
However, in the task proposed in this paper of generating images containing visual metaphors from linguistic metaphors, the model is first required to identify the implicit meaning of the prompts and their relationship to the associated objects, and then to find a way to combine them in the generated images. and how to combine them in the generated image.
As an example of the difficulty these tasks pose to existing generative AI, see the figures below (left: images generated by regular DALL-E2, right: images generated by DALL-E2 using this framework).
This compares the output of the generative model with the prompt " My bedroom is a pig sty", which contains the languistic metaphor " My bedroom is a mess".
In response to this input, the normal DALL/E2 only produces an image of a pink room (probably due to the pig's skin color) with a toy pig, indicating that it does not capture the metaphor of apigsty =clutter.
On the other hand, DALL and E2, using this framework, were able to generate images that represent these, and from this example, the limitations of existing generative AI and the effectiveness of this framework can be read.
Human-AI collaboration framework&Human-AI Visual Metaphor dataset
In this paper, the HAIVMet (Human-AI Visual Metaphor ), a dataset consisting of 6476 images with visual metaphors, was created using the Human-AI collaboration framework shown in the figure below. The HAIVMet dataset contains 6476 visual metaphors.
The procedure for creating this dataset is as follows
- Select linguistic metaphors that are easy to represent when generated as images.
- Using a large-scale language model, sometimes with the help of experts, generate prompts (=visual elaboration ) to output images that capture the relationship between visual metaphors and related objects.
- Generate high quality images containing visual metaphors from visual elaboration by using a diffusion-based model and filtering low quality images by experts
Let's look at them one by one.
Visually Grounded Linguistic Metaphors
Considering that not all linguistic metaphors can be rendered as images, the authors first manually selected linguistic metaphors that are easy to represent when generated as images.
For example, " love" can be expressed by two people holding hands with a heart above it, "confusion" can be expressed as a question mark, and "idea" can be expressed by a light bulb over the head.
On the other hand, items that represent non-visual phenomena, such as smells and sounds, are excluded because they are difficult to represent in images.
Visual Elaboration Generation with Chain-of-Thought Prompting
The generative model did not work well with prompts containing linguistic metaphors because it could not model implicit metaphorical expressions.
Therefore, the authors focused on Chain-of-Thought (CoT) Prompting, a prompting method to improve the inference capability of language models.
This is the way the model decomposes the problem into multiple steps, and this framework uses CoT Prompting to generate prompts that elicit the implicit metaphors and associated objects of linguistic metaphors using Instruct GPT-3
In this paper, we call the prompts generated by this sequence using CoT Prompting visual elaboration, and we find that using these prompts helps the model output images that contain better visual metaphors .
However, while this approach produces high-quality prompts, not all of the generated visual elaborations are perfect, so we asked three experts in figurative language to work as annotators to make edits to the imperfect visual elaborations They were asked to work together to make edits to the incomplete visual elaborations.
An example of editing a prompt is shown in the figure below.
The two images shown in the figure are generated by DALL-E2 from a visual elaboration and a prompt edited by an expert based on a sentence containing the visual metaphors " The news of the accident was a dagger in her heart". The image is generated by DALL-E2 based on the visual elaboration and the prompts edited by an expert.
Figure a is the output from the prompt "An illustration of a heart with a dagger stuck into it, dripping with blood and pain in the woman's eyes. The image from the prompt "An illustration of a heart with a dagger stuck into it, dripping with blood and pain in the woman's eyes.
Figure b, on the other hand, is an expert-edited version of the above prompt, "An illustration of a woman receiving a phone call and her heart with a dagger stuck into it, dripping with blood and pain in the woman's eyes. The image output from the expert-edited prompt "An illustration of a woman receiving a phone call and her heart with a dagger stuck into it, dripping with blood and pain in the woman's eyes.
Visual Metaphor Generation and Human Quality Check
Finally, after having DALL-E2 generate multiple images using the prompts generated in the aforementioned steps as input, we checked each generated image to see if the expert accurately represented the original linguistic metaphors.
The dataset thus collected contains 1540 unique linguistic metaphors (and their associated visual elaborations) and 6476 images, which the authors have named HAIVMet (Human-AIVisual Metaphor).
Evaluation
In order to evaluate the created HAIVMet, this paper conducted a validation study comparing the images contained in the HAIVMet with those output to an existing model using prompts generated using the aforementioned Human-AI collaboration framework.
The model used for the verification is as follows
- LLM-DALL-E2: DALL-E2 with prompts generated using the Human-AI collaboration framework as input
- LLM-SD: Stable Diffusion with prompts generated using the Human-AI collaboration framework as input
- LLM-SD-Structured: LLM-SD plus the dissusion method used in previous studies
- DALL-E2: Normal DALL-E2
- SD: Normal Stable Diffusion
The results are shown in the figure below.
The Metaphor statement in the left column of the figure is the linguistic metaphors paired with the HAIVMet image, while the images for each other model show the images generated when the Metaphor statement was used as input.
The figure shows that HAIVMet's images are able to express the metaphors contained in the sentences well. HAIVMet is able to express all of these elements well.
In addition, it is noteworthy that the images generated by LLM-DALL-E2, LLM-SD, and LLM-SD-Structured, which employ the Human-AI collaboration framework proposed in this paper, successfully, if not perfectly, captured the metaphorical expressions, demonstrating the effectiveness of this This result demonstrates the effectiveness of the framework.
summary
How was it? In this article, we described a paper in which we developed a Human-AI collaboration framework that combines a large-scale language model and a diffusion model to create HAIVMet (Human-AI Visual Metaphor), a dataset consisting of 6476 images containing visual metaphors, and to generate images containing visual metaphors. Metaphor), which is a dataset of 6476 images containing visual metaphors, by building a Human-AI Visual Metaphor (HAIVMet), a dataset of 6476 images containing visual metaphors.
The vast amount of information in the dataset collected in this paper will be a very important resource for understanding the limitations of current image generation AI and for building more expressive models, including metaphors, in the future.
In addition, the authors mention that they will further examine the relationship between the quality of the visual metaphors in the generated images and the prompt phrases, and how this effect varies across models, and are very much looking forward to further progress.
The details of the data sets and experimental results presented here can be found in this paper for those interested.
Categories related to this article