Disentangled Diffusion: T2I Model To Extract Multiple Concepts From A Single Image

Image Generation 26/05/2024

3 main points
✔️ Proposes Disentangled Diffusion (DisenDiff) to extract multiple concepts from a single image
✔️ Introduces loss to separate each class without overlap andfaithfully extracts the appearance of the concept to be extracted
✔️ DisenDiff demonstrated that it outperforms SOTA in both qualitative and quantitative evaluations

Attention Calibration for Disentangled Text-to-Image Personalization
written by Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang
(Submitted on 27 Mar 2024 (v1), last revised 11 Apr 2024 (this version, v2))
Comments: CVPR 2024 (Oral)
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In recent years, dramatic advances in Text-to-Image (T2I) modeling with large corpora have greatly improved the quality of image generation and synthesis. With a few input images, it is now easy to generate new concepts that do not exist in the reference images. At the same time, however, there remains the challenge that when the dataset is a single image, the attention map becomes ambiguous, making it difficult for the diffusion model to learn and generate concepts and appearances that are specific to that image.

Therefore, in this paper, an attentional calibration mechanism is proposed to improve the conceptual understanding of the T2I model. In order to extract multiple concepts from a single image without interfering with each other, this mechanism introduces learnable modifiers associated with classes to suppress mutual influence between different concepts and enhance class-by-class understanding.

The proposed method in this paper, named Disentangled Diffusion (DisenDiff), was demonstrated to outperform SOTA in both qualitative and quantitative evaluations on a variety of data sets. Furthermore, it was reported to be highly flexible for extended tasks, including interoperability with LoRA and inpainting techniques.

DisenDiff

The DisenDiff attention calibration mechanism proposed in this paper uses three stages to solve multiple concepts from a single image, as shown in Figure 1.

The attention map output from Stable Diffusion is first sharpened for each class by suppression techniques. Next, a loss _Lbind is introduced so that the modifiers correspond to the classes. Finally, the loss _Ls&s is added to isolate each class independently. Each step will be explained in the next section.

This paper uses Stable Diffusion as the backbone model and CLIP as the pre-trained text encoder.

Figure 1: Overall view of DisenDiff

Introduction of learnable modifiers

The training of the T2I model requires the input of appropriate text prompts along with the input images. In this paper, the modifier token " ^{Vi*" is} inserted before the class token "cat" as in " ^V1* cat and ^V2* dog".

Modifiers and Class Binding

Traditional methods tend to overfit when the training data is a single image, resulting in an ambiguous attention map for each token. Using the modifier token attentions introduced earlier, we will generate an accurate cross-attention map based on Figure 2.

Figure 2 Attention map for each token

The topmost map in Figure 2 shows that the class token's attention map roughly captures the semantic boundaries of the class compared to the modifier token's map. Therefore, we aim to increase the IoU between the modifier token and the class token's attention map, activate the modifier token, and align it with the corresponding class token by providing the following loss.

However, adapting this loss as it is can lead to issues such as competing attentions at the same pixel or modifiers that do not comprehensively capture the concept. Therefore, in this paper, the attention map is smoothed by applying a Gaussian filter G( _At ) to the attention map before computing the loss.

Class separation and enhancement

Although we mentioned in the previous section that there is a tendency to over-fit when the training data is a single image, there is also a concern that the class tokens may encroach into other class regions. The topmost map in Figure 2 shows that the attention map for the "cat" token has also encroached into the "dog" region to some extent.

So, by giving the next loss, it strikes a balance between avoiding overlap with other objects and ensuring comprehensive coverage of the object, thus improving the accuracy of the attainment.

However, adapting this loss as is may result in an unnatural emphasis on certain classes due to imbalances in the activation distribution among different classes. Therefore, this paper performs an element-by-element multiplication on the attention map as_fm ( ^_Atci )^{_=Atci⨀Atci} before computing the loss to exclude activation distributions that are less important for a class.

Overall training loss

From what we have seen so far, the loss to be used during training can be expressed as follows where S is the number of classes in the input image.

Experiment

Data-set

In this paper, experiments were conducted on 10 datasets spanning a wide range of categories, including people, animals, furniture, and people with pets/toys. Within each image in the dataset, two different concepts are included. The inference was tested with 30 prompts per image, with 10 prompts focusing on both concepts, 10 prompts focusing on the first concept, and 10 prompts focusing on the second concept.

Valuation index

The evaluation metric is based on image-alignment and text-alignment, which are evaluated holistically. image-alignment measures the CLIP space cosine similarity between the generated image and the corresponding real image, while text-alignment calculates the similarity between prompt text and image. similarity between the prompt text and the image. These evaluation metrics provide a balance between image reconstruction capability and editability.

Experimental results

The proposed method model DisenDiff is compared with three state-of-the-art T2I models, Textual-Inversion (TI), DreamBooth (DB), and Custom-Diffusion (CD), with quantitative comparison results shown in Figure 3(a).

As can be seen from the mean in the upper left panel of Figure 3(a), the proposed model outperforms all the compared models in terms of image-alignment. alignment is low, indicating that the original concept is not well preserved during generation. Therefore, we found that the proposed method's model achieves the highest image reconstruction capability while maintaining the text-editing effect.

The results of the ablation analysis performed to confirm the need for each component are shown in Figure 3(b). The figure shows that the proposed method with all the policies adopted achieves the most balanced performance.

Figure 3 Quantitative evaluation comparison between the proposed method and three other T2I models

Figure 4 shows the visual comparison results of the proposed method and the two models with high image-alignment for various prompts.

Target prompts are entered that evaluate learned independent and combined concepts in a variety of editing scenarios, such as changing scenes, adding objects, changing styles, and responding to concept detachment. For example, for a cat and dog on a single input image, the prompt is to change the breed of the dog to generate an image.

From Figure 5, we can read that the DB composite image is characterized by the lack of essential elements in forming the concept to be retrieved, and that the CD composite image has the disadvantage of not preserving the appearance of the concept. The proposed method model DisenDiff has a higher fidelity to the original image than the other models.

Figure 4 Comparison of qualitative evaluation between the proposed method and two other T2I models

The paper concludes by showing that the proposed model is interoperable with LoRA and in-painting pipelines and can be used to develop user-friendly applications.

Conclusion

In this paper, DisenDiff was proposed to learn multiple concepts from a single image without duplication. In order to construct an accurate attention map for each token, a unique loss was introduced in the cross-attention unit to accurately capture the appearance of the concept to be retrieved while reducing over-fitting to a single image. Comparison with other T2I models shows that the proposed method's model exhibits high fidelity to the input image and high image reconstruction capability, demonstrating both quantitative and qualitative state-of-the-art superiority.

What is particularly impressive about the proposed DisenDiff method is that it is able to extract the unique concepts of a single image better than conventional methods by introducing a unique loss to the cross-attention map. This method is compatible with other methods such as inpainting techniques, and has the potential to improve the quality of image synthesis techniques. I am very much looking forward to its progress.

Categories related to this article

Nagami