DALLE-2 Gets Its Own Language!
3 main points
✔️ Black box investigation of proprietary language handled by DALL-E2
✔️ Questionable consistency as a proprietary language
✔️ Challenges with model interpretability and security
Discovering the Hidden Vocabulary of DALLE-2
written by Giannis Daras, Alexandros G. Dimakis
(Submitted on 1 Jun 2022)
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
From 2021 to 2022, a series of technologies that generate images from text, called Text-to-Image, have been announced and are attracting a lot of attention. And it is becoming more familiar. Until now, technologies for generating images from the text were not generally available and had a high threshold. However, with the June 2022 release of the DALL-E mini image generation system, it is now possible to try image generation, and various images generated by DALL-E mini are being tweeted on Twitter. Many of you reading this article may have used DALL-E mini, and while it may not be as complete as DALL-E2 or Imagen, many of you may have been surprised by the experience of having an image generated from a text.
|months and years||model name||development team|
|May 2022||Imagen||Google (WWW search engine)|
|June 2022||Parti||Google (WWW search engine)|
In this article, we introduce a paper on DALL-E2, an image generation model that has been attracting a lot of attention. In this paper, we found that there is a certain relationship between DALL-E2 and the generated images when we input sentences that at first glance do not make any sense (absurd prompts). In other words, we found that DALL-E2 has its vocabulary even if the sentences () are meaningless to humans.
For example, when the sentence "Apoploe vesrreaitais eating Contarra ccetnxniams luryca tanniounons" is input to DALL-E2, the image shown below is generated. This result shows that DALLE-2 handles its lexicon, and that "Apoploe vesrreaitais" means "bird" and "Contarra ccetnxniams luryca tanniounons" means "insect:". In other words, this prompt could mean "birds eat insects" in DALLE-2's vocabulary.
How to find DALL-E2's language
The method implemented to find the language handled by DALL-E2 is a black box method. The method is based on finding words and the relationship between words and their word sequences from the input sentences and the output images.
For example, if you want to know the meaning of the word "vegetables", input the following sentence to DALL-E2.
- A book that has the word vegetables written on it.
- Two people talking about vegetables, with subtitles.
- The word vegetables is written in 10 languages.
DALL-E2 often generates images with sentences for these inputs. However, as shown in the DALL-E2 paper and some other reports, the sentences are not meaningful to humans. For example, when the sentence "Two farmers talking about vegetables, with subtitles." is input into DALL-E2, an image like the one shown in Figure (a) below is generated. As you can see from this figure (a), the words are completely incomprehensible to humans.
However, what we found in this paper is that these words have a meaning and can be said to be the unique vocabulary of DALL-E2. In this paper, we input the words "vicootess" and "Apoploe vesrreaitais", which are in the image generated in Figure (a), to DALL-E2. Then, as Figures (b) and (c) show, "Vicootess" appears to mean vegetable, and "Apoploe vesrreaitais" appears to mean bird. In other words, Figure (a) appears to show two farmers talking about a bird that damages their vegetables.
Thus, when the image is generated again by DALL-E2 from the words in the image generated by DALL-E2, we can assume that there is consistency (meaning) in the words handled by DALL-E2.
However, this paper also points out that such a method is not always effective. In other words, it is possible to generate random images without consistency. You can access DALL-E2 via the API.
Unique language features of DALL-E2
We are conducting several experiments to investigate the characteristics of the unique vocabulary found in DALL-E2. First, we asked whether the unique vocabulary handled by DALL-E2 can be composed of two words in a single sentence, as in human languages. Apoploe vesrreaitais eating Contarra using the two words "Apoploe vesrreaitais" for "bird" and "Contarra ccetnxniams luryca tanniounons" for "insect" or "pest". ccetnxniams luryca tanniounons" and input them into DALL-E2. As a result, we confirmed that it generates an image of "a bird is eating an insect" as shown in the figure below. We confirmed that such an image is not always generated, but it is possible to generate an image.
We then add words for image styles (Painting, Cartoon, 3-D rendering, line art) to "Apoploe vesrreaitais" for "bird" to see if "Apoploe vesrreaitais" corresponds to a visual concept The results are shown in the figure below. The results are shown in the figure below, and it seems that the word sometimes changes to "flying insect" instead of "bird" as in (c) and (d).
We also investigate the consistency between the text in the generated images and the images generated from that text. For example, as mentioned above, inputting the text "Two farmers talking about vegetables, with subtitles." produced an image of two farmers talking about birds damaging their vegetables. In addition to the word "vegetables", which was entered into DALL-E2, the word "Apoploe vesrreaitais" (bird), which seemed most likely for the situation, was also added to the generated image. In other words, the word by DALL-E2 (Apoploe vesrreaitais), which at first glance seemed meaningless, turned out to be meaningful (bird) when visualized by DALL-E2.
As another example, when the sentence "Two whales talking about food, with subtitles." is input into DALL-E2, two whales and a sentence like "Wa ch zod ahaakes rea" are generated as shown in the figure below (left). The sentence will be generated as shown in the left figure below. When this sentence is input to DALL-E2, "seafood" is displayed as shown in the figure below (right), which is consistent with the original generated image as a line. In this way, it is not that unrelated sentences are generated, but it seems that consistent sentences suitable for the situation are generated.
Challenges for DALL-E2
Some topics that may require further research are also mentioned. First, the words that are treated as DALL-E2's language in this paper (e.g. Apoploe vesrreaitais) seem to have been chosen for their relative consistency: they often change in meaning each time they are entered into DALL-E2. In other words, it seems that Apoploe vesrreaitais does not always mean the same "bird", but sometimes means a different animal, etc.
This point has been a hot topic on Twitter, and researchers seem to be divided on whether it can be judged as a unique language, and there are precedents reported on Twitter ( related tweets *English ) that show completely different behavior for the words introduced here.
The paper states that such behavior is a major concern from the perspective of model interpretability and security and that more fundamental research is needed to understand these phenomena to create robust image generation models that behave as expected by humans.
This was a highly topical paper about an image generation model that has surprised the world and may have acquired a new "unique language". As it has already been introduced in many web media, so much so that it may be said to be an innovative and very high-profile technology. However, there are many phenomena in these high-performance image generation models, including not only DALL-E2 but also Imagen, that we do not understand, and there are concerns about unexpected abuse. Therefore, the situation is not open to the public. As research, this is an interesting technology that shows the potential of machine learning, but it may take a little more time to put it to practical use. Nevertheless, the progress in the past year has been dizzying, and I'm looking forward to seeing more and more of it in the future.
Categories related to this article