Fine Tuning Of TEXT-TO-IMAGE Diffusion Model For Spurious Feature Generation

Image Recognition 13/03/2024

3 main points
✔️ Spurious images help measure classifier reliability
✔️ Filteringmanyspuriousimagesfrom the Internet to find morespuriousfeaturesis time consuming
✔️ TEXT-TO-IMAGE diffusion model Fine tuning suggests a way to generate spurious images

Fine-Tuning Text-To-Image Diffusion Models for Class-Wise Spurious Feature Generation
written by AprilPyone MaungMaung, Huy H. Nguyen, Hitoshi Kiya, Isao Echizen
(Submitted on 13 Feb 2024)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Deep neural networks (DNNs) have achieved state-of-the-art results in visual recognition, natural language processing, and speech recognition. However, evaluating DNNs is not easy, especially in more important areas such as breast cancer screening and automated driving.

Typically, the performance of an image classifier is evaluated on a fixed test set, which may differ from real-world operations. For example, the ImageNet test set does not reflect real-world performance.

One method that has received a lot of attention recently for better evaluating classifiers is the use ofspuriousfeatures and spuriousimages.Spuriousfeatures can simply be understood as features that often appear together with the main features of an image.

For example, cattle images often contain grasslands, and hummingbird images often contain red salvia flowers.Here, the cows and hummingbirds are the primary features, while the grassland and red salvia flowers are spuriousfeatures. An image with spurious features is considered aspuriousimage.

If only spurious features are associated with a class, this can cause shortcut learning. For example, if a model uses thered salviaflower feature toclassifyhummingbirds, the modelmayeasily classify a photo with only red salvia flowersas a fly, or conversely, a photo with no red salvia flowers and onlyhummingbirds as a hummingbird. Therefore ,evaluating DNNs with spurious featuresis critical for safety-critical applications.

In a recent study, we introduced the Spurious ImageNet by detecting spurious features from large datasets such as ImageNet. However, we found that not all images in Spurious ImageNet have spurious features with different classifiers (Figure 1 ). Furthermore, filtering images with spurious features from the Internet is a time-consuming task.

Figure 1. example of spurious images; some images from the Spurious ImageNet dataset were detected as "hummingbirds" but were classified as "threadfin" butterflies.

The paper described here proposes to leverage Stable Diffusion's large-scale Text-to-Image model to generate images with spurious features across different classifiers. It aims to complement the Spurious ImageNet.

Technique

Summary

Figure 2: Overview of the entire fine-tuning process

Given a few spurious images of a particular class, the goal is to generate new spurious images for this particular class across different classifiers. Figure 2 shows the proposed fine-tuning framework for the Text-to-Image diffusion model.

The framework is based on DreamBooth [Ruiz et.al, 2023], but the main difference from DreamBooth is the addition of new losses and the joint fine tuning of the text encoder and noise predictor. The new loss is computed from the similarity between spurious and nonspurious images to facilitate the generation of spurious features. These details are discussed in the following subsections.

Stable Diffusion and Learning Loss

Diffusion models are a type of generative model and include two processes: diffusion processes and reverse diffusion processes.

In the diffusion process, the input image is gradually added noise until it becomes Gaussian noise. This process is predefined and used as supervised data for the inverse diffusion process.The inverse diffusion process, on the other hand, gradually removes noise from the complete noise until the original image can be recovered.

At each step, the inverse transformation (prediction of the added noise) is learned. In other words, after the inverse diffusion process has been learned, the image can be generated from the complete noise.

Also, when combined with text input conditions, a Text-to-Image generative model can be realized: Stable Diffusion [Rombach et.al, 2022] is a widely known large-scale Text-to-Image diffusion model in latent space, which can be It is considered a model.

Given a text condition y (i.e., a text prompt), the loss function for learning is

where ϵ andϵ_θ are the added and predicted noise, and τθ is the text encoder.

Personalization of Stable Diffusion

Given an image of a subject, the idea of personalization is to embed the subject into the output region of Stable Diffusion and synthesize a new representation of the subject in a different context. After personalization, a new image of the subject can be generated.

The personalization technique in this paper is to adjust Stable Diffusion to integrate new information about the subject into the output region without overfitting a small number of reference images or losing prior knowledge.

The technique is similar to DreamBooth, fine-tuning the noise-predicting U-Net in Figure 2 with a reference image containing a unique identifier (e.g., a photo of the flower in [identifier]) and a text prompt. To preserve prior knowledge, we introduce class-specific preservation loss of prior (PPL) as in Equation 2.

x′ is the image generated from the pre-trained Stable Diffusion with a text prompt (e.g., a photo of [class]) that does not contain [identifier]. The overall loss function synthesized from the loss functions in Equations 1 and 2 is where λ is a hyperparameter.

Spurious feature similarity loss

Spurious feature similarity loss (SFSL) is also proposed to facilitate the generation of spurious features.

As shown in Figure 2, a trained model is used to estimate spurious features from reference and generated images. The paper we are describing uses a trained model of the Spurious ImageNet. The features are computed from the class k, the input image x and the features ϕ(x) in the final layer of the Spurious ImageNet using the following equation

Equation 4.Calculation of spurious features

Spurious Feature Similarity Loss (SFSL) is calculated from the Cosine Similarity S_C of the spurious features of the reference image and the spurious features of the generated image.

Equation 5. spurious feature similarity loss (SFSL)

This loss is synthesized in Equation 3 with the κ hyperparameter to obtain the final loss function of the proposed method as in Equation 6.

Equation 6. final loss function of the proposed method

Experiment

Datasets and Classifiers

For our experiments, we used the Spurious ImageNetdataset. It contains 100 classes.

Each class has 75 spurious images with a resolution of 367 x 367, for a total of 7,500 images. As noted above, not all images in Spurious ImageNet are consistently spurious across different classifiers.

Therefore, we selected six images that are all spurious for each test class for t the following four classifiers: the ResNet-50 (PyTorch versions 1 and 2) [He et.al, 2016 ], the robust ResNet-50 [ Croce et.al, 2022 ], the ViT-B/16 [ Steiner et.al, 2022 ].

Spurious accuracy

We sampled 75 images from each test class and observed the accuracy of the spurious class compared to Spurious ImageNet with four classifiers, ResNet-50 V1 and V2, Robust ResNet-50, and ViT-B/16.The 75 images generated were randomly selected.

Table 1 summarizes the spurious accuracy results, where SI indicates Spurious ImageNet." For all test classes except "flagpole," the generated images had more spurious features across different classifiers. This indicates that the proposed method is complementary to Spurious ImageNet when evaluating the spurious performance of existing ImageNet classifiers.

The proposed method can be used to create a more robust spurious test data set.

Table 1: Spurious accuracy (%) of the generated images and Spurious ImageNet (SI) of the proposed method.

This paper is the first attempt to utilize a large-scale Text-to-Image diffusion model to generate spurious images. As such, it is not directly comparable to other methods.

However, since the method is built on DreamBooth, a comparison with DreamBooth is made here. Table 2 compares the average spurious accuracy of the six classes across the four classifiers. The generated images have more spurious features due to the joint training of DreamBooth and the text encoder.

The addition of the proposed spurious feature similarity loss (SFSL) further improved the spurious accuracy. It was observed that the hyperparameter κ has different effects on different classes. Therefore, we find it necessary to adjust the κ value based on the target class.

Perceived quality

TOPIQ [Chen et.al, 2023], a state-of-the-art perceptual image quality evaluation metric, was used to objectively measure the perceptual quality of the generated spurious images.

Table 3 summarizes the objective evaluation results where TOPIQ scores were calculated for 6 images (all training images) and 75 images (generated images) for each class. Scores for the generated images were close to the actual images for the "hummingbird" and "koala" classes. However, scores were lower for the other classes.

To further evaluate the quality of the spurious-generated images, a subjective evaluation is performed in the next subsection.

Table 3. n Average TOPIQ scores for the images

Subjective evaluation

Ten users (researchers, students, and non-technical users) were asked to perform a subjective evaluation. Ten random images (a mixture of real and generated ones) from each class were displayed and asked to rate them on a scale of 1 to 5 based on naturalness.

Figure 3 summarizes the subjective rating results. On average, 46.33% of users gave the actual image the highest rating of 5 (very natural), while 20% gave it to the generated images. It is shown that some generated images are natural and realistic.

Figure 3: Subjective evaluation of real and generated images

All six classes of generated images were also manually checked and diffuse artifacts were observed in some images. Figure 4 shows selected generated images compared to the Spurious ImageNet images. However, the generative model does not have a generative upper limit, so many images can be sampled in different configurations to obtain a satisfactory image.

Figure 4. example of a generated image (second row) and Spurious ImageNet (first row). Red labels are predicted classes, black labels are true subjects

At the end

This paper shows that given a few spurious images from the Spurious ImageNet, one can fine-tune Stable Diffusion to take advantage of the new spurious feature similarity loss.

The proposed method saves time in filtering many images to find spurious features. Thus, the proposed method complements Spurious ImageNet in the preparation of spurious feature test datasets. Experiments confirm that the generated images are spurious across different classifiers and visually similar to the Spurious ImageNet images.