Successful Cross-domain GAN Learning With Only 10 Images

GAN (Hostile Generation Network) 12/07/2021

3 main points
✔️ Successful cross-domain learning of GANs with only 10 cards
✔️ Learned diversity as differences between features
✔️ Demonstrated overwhelmingly high accuracy

Few-shot Image Generation via Cross-domain Correspondence
written by Utkarsh Ojha, Yijun Li, Jingwan Lu, Alexei A. Efros, Yong Jae Lee, Eli Shechtman, Richard Zhang
(Submitted on 13 Apr 2021)
Comments: Accepted by CVPR 2021.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)

code：

first of all

Training a GAN generally requires a large number of training images, and training on a small number of data often results in overfitting. However, in this paper, we change the way of thinking a little bit and introduce our research that we pre-train a large dataset as a source domain and let GAN generate even only 10 images by considering about 10 images as a generation to the target.

In the figure below, we can see that $G_s$ trained on FFHQ, which has enough training data, is able to generate face images. However, when we fine-tuned it with a small number of paintings data (we think that these 10 images were actually used for training), we can see that it is overfitting as expected (Overfit $G_s→t$). However, in the paper presented here, we can see that we can learn without overfitting while maintaining the diversity of the source domain with only 10 training cards (Our $G_s→t$ ). We believe that this paper was accepted for CVPR2021 because it changes the way we think and shows a fairly simple but promising result.

proposed method

The key to this method is how to ensure diversity without overfitting it with a few data!

Ensuring diversity

The idea is pretty simple: by adding a regularization that maintains the feature differences between source domain images to the domain destination, the target domain can also inherit the feature differences (diversity) of the source domain to ensure diversity. It is the following formula. I have added comments only on the source side, but it is the same for the target side.

Using KL divergence from the above two, we are trying to make the target have the same distribution as the source (equation below), which is similar in idea to Contrastive learning!

Prevent Overfit

The reason why it is easy to overfit in the first place is, as you may intuitively understand If the data set is small, the distribution that can be learned is also small, so it can be handled just by memorizing. The interesting thing from this is that with a small number of data The definition of what constitutes a "realistic" sample becomes even denser. So we focus on the fact that a few training images only form a small subset of the desired distribution. For example, if the ideal overall latent variable can be learned from 100 images, then 10 images can learn a tenth of the ideal latent variable (subset), although not exactly. We now define an anch domain $Z_{anch}⊂Z$ that forms a subset of the whole latent space which forms a subset of the whole latent space. When sampled from these regions, we use the full image discriminator $D_{img}$. Then, by defining $D_{patch}$ as a subset of the larger $D_{img}$ network For a patch of images, the discriminator $D_{patch}$ ( using valid patch sizes ranging from 22 × 22 to 61 × 61 ). By doing so we can't simply memorize the whole image. We have to look at the details.

experiment

The following models are used for comparison.

Transferring GANs (TGAN)
Batch Statistics Adaptation (BSA)
MineGAN
Freeze-D
Non-leaking data augmentations
EWC

For the dataset, we will use the following. The source will be the pre-training domain and the target will be the image domain we want to generate.

source

Flickr-Faces-HQ (FFHQ)
LSUN Church
LSUN Cars
LSUN Horses

Target

face caricatures
face sketches
face paintings by Amedeo Modigliani
FFHQ-babies
FFHQsunglasses
landscape drawings
haunted houses
Van Gogh's house paintings
wrecked/abandoned cars

result

The results clearly show that the proposed method is working. It is obvious when comparing the hat and face orientation.

Results for similar domains

I made $G_s(Z)$ like Real samples. If you look at $G_t(Z)$, it works well too. In particular, the sunglasses domain on the far right is easy to understand. However, we have found some issues here. In the sunglasses domain, when you wear sunglasses, your hair becomes darker. This is not what is expected. There are women with blonde hair in the source, so the difference between those features is not working well.

Results for different domains

We are experimenting with this different domain with the idea that since we are combining the differences between features in the source domain with the differences between features in the domain, we can see trends that can be tied to each feature in the change.

The results for Church are easy to understand, but it seems that the window features of Church (source) and the eye features of Caricatures (domain) are, somehow, paired. It seems to me that we are learning as the authors suggest.

quantitative evaluation

The results of the quantitative evaluation of each result are shown in the table.

FID score (top) and LPIPS distance (bottom). Both show the highest accuracy.

summary

It is surprising that the generation works well with only, 10 images. I think it is highly appreciated because it learns as the author thought. It implements a fairly intuitive idea to ensure diversity as the difference between features, and to put a learning trick in the Discriminator to prevent overfitting, which is not easy to learn, and it works well.

The authors have only implemented the idea this time, and there is still further development that can be expected by considering a representation method that can guarantee diversity. In the future, there may be a more efficient way to utilize GANs learned from large-scale data by using this representation method.

For example, if we take the idea of GAN and minority data... It is just an idea, but we may be able to create a large number of lung nodule images by training $G_s$, which is trained with normal X-ray images, with X-ray images showing a small number of lung nodules. Of course, the properties of natural images and medical images are different, but I considered that the difference in properties can be solved by devising a way of expression.