What Is "Imagic" For High-definition Image Editing With One Text And One Image!
3 main points
✔️ Only one text and one input image, enabling high-definition image editing along the text
✔️ Linear interpolation of Embedding of two texts and combining the two pieces of information, enabling high-definition editing with the Diffusion Model
✔️ Applicable to various types of image editing (pose changes, editing multiple objects, etc.), high quality and versatility
Imagic: Text-Based Real Image Editing with Diffusion Models
written by Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani
(Submitted on 17 Oct 2022)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
outline
In the past few years, models that generate images from the text have attracted a lot of attention. Many models have been announced, including DALL-E 2, Parti, Imagen, Stable Diffusion, and Midjourney, and some of these services can be used by ordinary users. Recently, there is also an official LINE account called " Drawing Bari Gutto-kun " which generates images that match the text when you send a text on LINE.
And now, a new image editing technology called "Imagic" has been announced, which applies these image generation models. Up until now, technologies for generating images from text have been announced in rapid succession. Still, this time it is a technology for editing a portion of an image in high definition to match the text. This can be achieved with only one text and one image.
There have been technologies to edit images from text and images in the past, such as "SDEdit" and "Text2LIVE ", but they were limited in what they could edit, such as applying colors, adding objects, and converting the style of the image. When inputting an image to be edited, supplementary information was required, such as information on the part to be edited and multiple images of the same object to be prepared. However, this new method requires only text and a single image, and no additional information is needed.
The following figure shows an image edited by "Imagic" introduced in this article. For example, in the bird image on the upper left, if you input the Input Image and the Target Text: "A bird spreading wings", an edited image of "A bird spreading wings" will be generated according to the meaning of the text. A bird spreading its wings" is generated as the edited image (Edited Image), which matches the meaning of the text. The edited image retains the information in the Input Image very well, down to the details of the background, perch, and bird pattern. In addition, in the parrot image in the center of the bottom row, two parrots of the same species have been edited to reflect the Target Text: "Two kissing parrots", respectively. In this way, even if there are multiple targets in one image, you can edit the image according to the meaning of the text without confusion.
How does Imagic work?
Imagic consists of three processes(A), (B), and (C) as shown in the figure below. When "Target Text", which represents how the image is to be edited, and "Input", which is the image to be edited, are input, (A) obtains the Embedding (etgt) of "Target Text". Then, using the pre-learned Diffusion Model, optimizeetgtin its neighborhood so that"Input"is generated, and obtaineopt. At this point, if etgt and eopt are too far apart, the divergence between Input and Output will be too large, resulting in unnatural editing results.
At this stage,eopt has insufficient reproducibility of Input, so in (b), eopt is fixed again and the Diffusion Model is Fine-Tunedso that more accurate Input can be generated from eopt. With these two processes, the eopt has an expressive power close to that of the etgt, but also retains detailed information about the input, such as background and placement, and can reproduce it with high accuracy. Finally, in (C), etgt and eopt are combined by linear interpolation, and a fine and subtle Output is obtained by using the Diffusion Model that was Fine-Tuned earlier.
The linear interpolation of etgt and eopt is expressed as follows: the η value is a hyperparameter that takes values between 0 and 1.
By adjustingη, we can adjust Output as shown in the figure below. the closer the value ofη is to 0, the closer it is toeopt(Input).
It is also shown that Fine-Tuning the Diffusion Model in (B) improves the reproducibility of the background and composition of Inputs. In the figure below, the upper row shows the result without Fine -Tune and the lower row shows the result with Fine-Tune. If you compare the upper and lower figures with η=0.000, you can see that detailed information (such as the background) is very different. The lower" with Fine-Tune" retains the knowledge of Input better.
How well does Imagic perform?
First, as a qualitative evaluation, we try various types of editing as shown in the figure below. From the top row, the results are shown for an image edited for posture, an image edited for components, an image edited for multiple objects, an image with added components, an image edited for painting style, and an image edited for color, respectively. You can see that all of these edits are very high-performance without any discrepancy.
Next, the figure below shows the results of entering different text for the same image. Both images produce a high-resolution image for any text, and you can also see that the image is versatile enough for all kinds of editing.
Imagic also uses a Diffusion Model, which is probabilistic and may produce different results for the same text and image. Below are images generated by different random seeds (η at each seed is fine-tuned).
We also investigate the relationship between different seeds and η values in this regard, as shown in the figure below. In the figure below, the results of image editing by different seeds are shown in the upper, interrupted, and lower rows. As you can see, different seeds seem to edit images with different η values. You can see that the upper row starts to jump at η=0.800 and the middle and lower row s start to jump at η=0.700. In the lower row, η=0. 700-0.800, it seems to jump in the opposite direction of the input image.
The authors of the paper also say that natural language text has ambiguous imprecision, and this probabilistic nature makes it easier to use by generating several alternatives.
Imagic still incomplete? And limitations?
As shown above, Imagic shows high performance in various qualitative evaluations. However, at the same time, it also shows some failures as shown in the figure below. For example, as shown in the top row, the results may not be a good fit for the image as a whole." In "A Photo of a traffic jam," some areas of the image reflect the traffic jam. However, the other lanes are so empty that the edit does not reflect the traffic jam. Also, in "A dog lying down", the editing of the dog works to some extent, but the box behind the dog disappears, so the editing as a whole does not work well.
Also, although the editing itself is applied appropriately, the zoom and camera angle may be affected. For example, in "A photo of a race car" on the left of the bottom row, a number is added to the car like a racing car, and the image is edited to look like a car race in the 1900s, but on the other hand, the car has changed to a distant position. In "Pizza with pepperoni" on the right in the bottom row, the pepperoni has been added without any sense of discrepancy, but the pizza has been enlarged and the image has been cropped. It is good at editing delicate details like this, but it seems that the whole image is sometimes corrupted.
We also compare the editing results with other major technologies (SDEdit, Text2LIVE) that can edit images with a single text and image, as shown in the figure below. It is clear from these results that Imagic is able to perform more detailed and subtle editing with higher accuracy than other techniques, while still retaining the detailed information of the original image.
summary
In this paper, we propose a new image editing method called Imagic. It achieves very subtle and delicate editing with only a single image to be edited and text that indicates what you want to edit.
We use a pre-trained Diffusion Model to find textual embeddings that can represent the input image well, then we Fine-Tune the Diffusion Model to better fit the image, and finally, we find Embeddings that fit the input image well and edit The Embedding of the text that conveys the goal is linearly interpolated and then the edited image is generated by the Diffusion Model.
In contrast to other editing methods in this paper, it allows a wider range of flexible editing, such as posing, shaping, and composing the image on demand, in addition to simple editing such as style and color. And these are made possible with only text and a single image, without the need for image masks or other auxiliary input.
In the future, they are going to develop a method to make editing more efficient by automatically selecting η according to the required editing. We expect more and more efficient editing and processing of video and still images using Photoshop in production sites.
However, on the other hand, it is expected that editing and processing of videos and still images posted on SNS will become easier and more sophisticated. The problem of deep faking has not yet been solved, and the battle between deep faking technology and technology to detect it continues. And the damage of fake information is also increasing. However, while we enjoy the convenience of this technology, its use will continue to be a matter of debate.
There's some code out there, so give it a try!
Categories related to this article