Restoring Faces Using GANs: See How These Photos Of 20th Century Scientists Come To Life.
3 main points
✔️ A new superior model for blind face restoration
✔️ Significantly outperforms all existing models
✔️ Rated higher than any other model by human reviewers
GAN Prior Embedded Network for Blind Face Restoration in the Wild
written by Tao Yang, Peiran Ren, Xuansong Xie, Lei Zhang
(Submitted on 13 May 2021)
Comments: Accepted by CVPR2021
Subjects: Computer Vision and Pattern Recognition (cs.CV)
code:
Introduction
Much progress has been made in the field of image restoration, while blind face restoration(BFR) still remains a challenge. BFR is more complex because current models are unable to generalize well to the various degradations a low-quality (LQ) image undergoes that are unknown during training. Some models restore images quite well with respect to artificial distortions but fail otherwise. There are models that achieve more realistic results, but they tend to over-smoothen the faces.
In this paper, we introduce a new method of restoring facial images in the wild( images that go through complicated distortions in real life). Specifically, we integrate a pre-trained Generative Adversarial Network(GAN) that had been trained for HQ face image generation and a Deep Neural Network(DNN) decoder. Our model sets a new state-of-the-art for BFR and is able to restore severely damaged images.
GAN Prior Embedded Network (GPEN)
Let us denote the space of low quality (LQ) images by X, and the space of original high quality(HQ) images by Y. The task in BFR is to correctly map an input LQ image x∈X to its corresponding original HQ image y∈Y. Current methods aim to train a DNN as a mapping function from X to Y. The problem with this approach is that this is a one-to-many problem, and there are many possible face images (y1,y2,y3...) for a particular x. These DNNs are trained using a pixel-to-pixel loss function wrt the target because of which the final solution y = DNN(x) tends to be a mean of those target faces used. This makes the generated faced over-smoothed and without detail.
To solve these issues, we train a GAN prior network and then embed it into a DNN decoder to generate HQ images. As shown in the above picture, the image is first passed through a CNN which maps it into the desired latent code z in the latent space Z. This latent code z is then passed through the GAN to produce the HQ image. Unlike previous methods, the GAN now performs one-to-one mapping i.e. maps the latent code to an HQ image. However, note that this does not allow GPEN to generate multiple HQ images from a single LQ image. Further architecture details will be discussed next.
Network Architecture
Our model GPEN has a UNET-like architecture (c). The first half consists of a DNN, and the second half consists of a GAN. Like a UNET, the feature map from each block in the first half is taken as input to the corresponding GAN block in the second half. Before combining the two halves together, the GAN is separately pre-trained to generate HQ facial images. Then the two halves are combined and fine-tuned for BFR. The GAN(a) consists of several GAN blocks (b) which can be chosen from any of the popular GANs: BigGAN, StyleGAN, PGGAN. We use a GAN block from StyleGAN-v2 because it is better at generating HQ images. Just like in StyleGANs, the latent vector 'z' obtained from the DNN is first transformed into a less entangled space W. This transformed vector 'w' is broadcasted to each GAN block. Also, while training the GAN alone, noise is broadcasted to each of the GAN blocks where it is concatenated with the feature maps. This noise is replaced by the respective feature map from the DNN for the combined model. Also, in the combined model, the latent vector 'z' is given by the output of the DNN. For more details on the GAN, blocks check this paper.
Training
The GAN is first trained independently on settings similar to StyleGAN. Then it is embedded with the DNN and finetuned with three loss functions: the adversarial loss LA, the content loss LC, and the feature matching loss LF. The adversarial loss is given as follows:
Here, D is the discriminator model, G is the generator model GPEN, X' is the LQ image, and X is the ground-truth HQ image. LC is the L1-norm between the ground truth image and the generated image. LF is the sum of L-2 norms between the feature maps of the discriminator for the generated and original image.
Here, T is the number of intermediate layers in the discriminator. So, the combined loss is as follows.
In all our experiments we set α = 1 and β = 0.02. The feature matching loss allows to balance the adversarial loss and recover more realistic/detailed images.
Experiments
We trained our model with the FFHQ dataset with over 70000 HQ images of 1024x1024 resolution. The same dataset is used to train the GAN prior network and to finetune the combined network. For finetuning, the LQ images are synthesized from the FFHQ dataset. The Hq images are randomly blurred, downsampled, some Gaussian noise is added, and compressed. Mathematically, the degradation can be described by the following model:
I, k, nσ, Id are the input face image, the blur kernel, the Gaussian noise with standard deviation σ, and the degraded image respectively. Similarly, ⊗, ↓s, JPEGq denote the two-dimensional convolution, the standard s-fold downsampler, and JPEG compression with quality factor-q. All three models: encoder, decoder and discriminator are trained using Adam optimizer with 3 different learning rates lrenc = 0.002; lrenc:lrdec:lrdis = 100:10:1.
Comparison of Variants of GPEN
In other to investigate the importance of the components of GPEN, we tested various variants of GPEN on BFR. GPEN-w/o-ft is a variant where the embedded GAN is not fine-tuned. GPEN-w/o-noise is a variant where no noise is added to the GAN blocks while training them. GPEN-noise-add is another variant where the noise inputs to the GAN blocks are added rather than concatenated.
The above table shows the PSNR, FID, and LPIPS score of all these variants on the FFHQ dataset. It is clear that the GPEN model is superior to its variants.
Comparison with other GANs
Many of the facial restoration GANs have been designed for the task of facial-image super-resolution(FSR): generating a high-resolution(HR) image from a low-resolution(LR) image. Therefore, we compare GPEN with other state-of-the-art GANs of FSR, synthetic BFR, and BFR in the wild.
The above table shows the results on FSR. We compared our model designed for BFR to models designed specifically for FSR. The LR images were generated using the CelebA-HQ dataset. The bilinear model which adds no additional detail to the model has the best scores on the PSNR metric, which shows that PSNR is not a good metric for FSR. It is fascinating that GPEN outperforms all other models on the FID and LPIPS metrics.
The above table shows the results for the BFR of LQ images synthesized from the CelebA-HQ dataset. Like FSR, GPEN outperforms the other models on the FID and LPIPS metrics by a large margin.
To highlight the practical significance of GPEN, we collected 1000 LQ face images from the internet, reconstructed them with GPEN and other SOTA models, and asked volunteers to rate the reconstructed image quality. The results show that the perceptual quality of images reconstructed by GPEN is much superior to other SOTA methods. Let us take a look at a few sample images.
As we can see, the other methods either tend to over-smoothen the images or fail to add any visual details to them.
In the future, we will extend GPEN to allow multiple HQ outputs for a given LQ image. For example, we can use an extra HQ face image as a reference so that different HQ outputs can be generated by GPEN for different reference images.
Summary
As we have seen, current SOTA models fail to generalize well to real-world deteriorated images. However, our model overcomes that difficulty by the carefully designed GAN and fine-tuning strategy. GPEN has direct practical applications. This work can be extended to other tasks like facial colorization, face inpainting, restoring images other than face images, or we could extend GPEN to generate multiple HQ outputs for a given LQ image.
Categories related to this article