New GAN That Reflects "meaning" In Image Generation

GAN (Hostile Generation Network) 11/03/2022

3 main points
✔️ Developed Instance-Conditioned GAN (ICGAN) which enables data generation conditioned by features of data points (instances) to deal with complex data distributions consisting of many modes.
✔️ We showed that ICGAN can perform semantic operations on data generation conditioned on classification classes and features.
✔️ We confirmed that ICGAN has high generative power even on datasets containing data that is not similar to the training

Instance-Conditioned GAN
written by Arantxa Casanova, Marlène Careil, Jakob Verbeek, Michal Drozdzal, Adriana Romero-Soriano
(Accepted at NeurIPS2021)
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

Generative Adversarial Networks (GANs) are neural networks that consist of a discriminator ( D) that distinguishes true data from artificial data and a generator ( G) that generates data from noise. It is a deep learning model in which the participants are trained by competing with each other.

One of the currently known issues in GANs is mode collapse. Mode collapse causes the quality and diversity of the generated samples to be compromised because the generated data is biased towards a subset of the training data.

This issue of mode collapse is known to make learning GANs on datasets such as ImageNet, which consists of a large number of object classes, difficult. To cope with such a large number of modes, conditional GAN (Conditional GAN) has been proposed to improve the sample quality and diversity in data generation with class labels.

Since a large amount of labeled data is required to train conditional GANs, recent attempts have been made to learn distributions consisting of a large number of modes without using labeled data. The main approach in such research is to acquire the class labels needed to condition the GAN by unsupervised learning, as exemplified by clustering. In this case, the granularity with which the data is partitioned into clusters is one of the factors that greatly affect generativity.

Instance-Conditioned GAN (ICGAN ) introduced in this paper solves the above problem by conditioning discriminators and generators on features corresponding to data points (instances) and teaching the neighborhood of data points to discriminators as true data. We will look at the details of ICGAN in the next section.

Overview of Instance-Conditioned GANs

The model overview of Instance-Conditioned GAN (ICGAN) is shown in the figure below.

ICGAN overview

The input image is mapped to the feature space by the feature extractor $f_\phi$, and the output feature $\mathbb{h}$ is input to the generator and discriminator. The generator generates an image $\mathbb{x}_g$ from this feature $\mathbb{h}$ and sampled noise $\mathbb{z}$. The discriminator also discriminates between the generated image $\mathbb{x}_g$ and the neighboring image $\mathbb{x}_n$ based on the feature $\mathbb{h}$. We use a trained ResNet50 as the feature extractor $f_\phi$, and this parameter is fixed regardless of ICGAN training.

In ICGAN, the neighborhood of the input image is also given to the discriminator as a positive example.

The neural network is trained by the minimax game shown below.

ICGAN minmax ICGAN learns the distribution of clusters with each data point as a representative point. The overall data distribution is represented as an overlap of these cluster distributions, which has the advantage of avoiding bias in the number of samples in the clusters compared to data partitioning using clustering methods.

Another strength of ICGAN is that it can provide a semantic interpretation, such as similar samples from similar data points because ICGAN is conditioned by features rather than cluster numbers.

Experiments on image generation using ICGAN

We have experimented with ICGAN-based image generation under two conditions: unlabeled and labeled.

Without labels, we evaluated the generated images on two complex datasets, ImageNet and COCO-Stuff.

We evaluated the generated images on ImageNet-LT, which is an extended version of ImageNet that includes ImageNet and classes with a small number of samples. In addition, we conducted an experiment in which we substituted class labels when labels were available and confirmed whether semantic control of the generated images as possible.

The results presented in this paper are the results of image generation in ImageNet with and without labels.

Evaluation index of generated images

The images generated by ICGAN are evaluated by Frechet Inception Distance (FID) and Inception Score (IS), which are general evaluation metrics of GAN.

In addition, we evaluate the diversity of the generated images using an image-specific metric called Learned Perceptual Image Patch Similarity (LP IPS). images to AlexNet and compute the distance of activity.

Results of image generation experiments

First, the results of image generation in ImageNet under unlabeled conditions are shown in the table below.

icgan unlabelled

We can see that ICGAN outperforms the conventional method at 64x64, 128x128, and 256x256 resolutions. The performance of the baseline BigGAN is inferior to ICGAN even when the number of parameters is increased to match ICGAN's capacity. (♰ in the table)

In addition, the improvement in generative performance when horizontal inversion is used as data extension (DA(d) in the table: positive and negative examples of discriminants, DA(i) in the table: input image) indicates the effectiveness of the data extension method in ICGAN.

Next, the results of image generation in ImageNet under the condition with labels are as follows.

icgan labelled

The same generative power over baseline BigGAN is observed for most of the resolutions.

The following images were generated for the unlabeled (left half of the figure) and labeled (right half of the figure) cases.

icgan generated

In each condition, the leftmost column is the input image used to calculate the features, and the right three columns are the images generated from the noise. In the case with labels, the class labels (ex. golden retriever) are given separately to generate images in a class different from the class of the input image.

The examples of adding a golden retriever to a grass background and a camel to a snow background show that ICGAN is capable of semantic manipulation, with the image looking like the background of the input image plus the content of the class label you gave it.

Interestingly, when ICGANs trained on ImageNet were used to generate images on COCO-Stuff, they showed better generation performance than ICGANs trained on COCO-Stuff. We speculate that this is due to the effectiveness of the trained feature extractor and generator.

in conclusion

I think ICGAN's two major contributions are that it deals with mode collapse by assuming a mixture distribution of clusters based on each data point and that it uses trained feature extractors to condition the "meaning" of the image rather than the cluster numbers.

On the other hand, the disadvantages of ICGAN are that it has to keep a large number of data points to cover complex data distributions and that its generative power depends on the feature extractor.

Surprisingly, we were able to construct a GAN that shows generalization performance even on datasets with different data distributions. It will be interesting to see if the GAN also works well for generating images of object classes that did not appear in the training of the feature extractor.

If you're interested, check out the results generated by COCO-Stuff and the generalization performance on other datasets in the original paper!