New GAN With Improved Editing Performance!
3 main points
✔️ GAN that can be edited per semantic region
✔️ Propose a learning framework that can divide the latent space by semantic region using semantic masks
✔️ Combine with existing image editing methods for more detailed editing
SemanticStyleGAN: Learning Compositional Generative Priors for Controllable Image Synthesis and Editing
written by Yichun Shi, Xiao Yang, Yangyue Wan, Xiaohui Shen
(Submitted on 4 Dec 2021 (v1), last revised 29 Mar 2022 (this version, v3))
Comments: Camera-ready for CVPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
StyleGAN was not only able to produce high-quality images, but also to edit the style of images by using latent variables that were conditioned on coarse to fine features, respectively. However, the meaning of latent variables in StyleGAN is relatively ambiguous, and because different attributes are correlated with each other, unexpected attributes or parts of attributes are edited when attempting to edit them, such as attribute manipulation.
To address this problem, several works have proposed new GANs, but these are targeted at manipulating global attributes and none of them allow local manipulation.
Local manipulation means, for example, editing only eyes or hair in the case of a face image. With existing models, these operations could affect other parts of the model, even if you tried to edit only the eyes.
The SemanticStyleGAN introduced in this article uses semantic masks to partition the latent space into semantic regions, enabling local manipulation.
The image above shows the result of editing a face image using SemanticStyleGAN.
The top line shows the edited semantic area and the reference image, and you can see that only the specified semantic area has been edited well, and the other areas have not been affected significantly.
proposed method
Two challenges have been identified for learning StyleGAN while separating each domain.
- How do you separate the different areas?
- How do you make sense of each domain?
For the first, we use local generators for each region and combine them to obtain the final output. as the input to the discriminator, and learn the joint distribution of the RGB image and the semantic mask.
The following diagram gives an overall overview of the learning framework.
We will now describe the local generator gk, the Fusion module, the rendering network R, and the training method, respectively.
local generator
The image above shows the structure of the local generator.
The input is a coordinate encoded Fourier feature and a latent variable w. The output is a depth value dk and a feature fk.
Also, latent variables w are divided into wbase, wsk, and wtk. By using these separately as input, it is possible to learn Coarse, Structure, and Texture separately, and each element can be manipulated during inference.
However, the local generator processes each pixel, so preparing 256x256 Fourier features as is would be computationally expensive. Therefore, to balance the performance and computational cost, the input size is reduced and the Fourier features are set to 64x64.
Fusion Module
output by the local generator. The depth value dk and the feature f k are fused in the Fusion module.
First, as in the above equation, the depth value d k generates a semantic mask m.
Next, using this semantic mask m, we generate a feature map f using the following formula.
The feature map f is simply multiplied by the semantic mask m and the featuresfk output by the local generator for each element. This generates a (number of classes)x256x256 semantic mask m and a feature map f, which is used to train the rendering network R described next.
Rendering Network R
The rendering network R uses a slightly modified version of the StyleGAN2 generator.
The image above shows the overall view of the rendering network R.
The style modulation of StyleGAN2 is excluded and the only input is a feature map. The input feature map is resized to 16x16 so that it can capture a wide range of features between classes.
We have more output branches to get a 256x256 RGB image and a semantic mask as output. Each branch outputs the residuals from the previous output. The final resolution output is obtained by repeated upsampling and merging of the smaller resolution outputs.
learning
To learn the joint distribution of the RGB image and the semantic mask, we take them both as input to the discriminator.
However, we found that simply combining the two and inputting the semantic mask does not work due to the large gradient of the semantic mask.
Therefore, we trained by using a discriminator with the structure shown in the following image.
This configuration allows us to penalize the gradient with R1 regularization on the segmentation side of the network so that it can be trained. We also add the following regularization to the loss function to ensure that the final mask obtained by upsampling does not deviate too much from the coarse mask.
where ∆m is the output of the semantic mask side branch in the rendering network R.
The final loss function is as follows
It is the loss function of StyleGAN2, to which we add the regularization on the semantic mask described above and the R1 regularization in the discriminator.
experimental results
Is the potential space properly separated?
First, to check whether the image can be generated by separating each region of the semantic segmentation, the following image is the result of generating the image while gradually increasing the components of the region. The pseudo depth map shown is for the added region.
We can see that each region can be generated independently. It also shows that the pseudo-depth map can learn meaningful shapes even though no 3D information is added.
editing capability
SemanticStyleGAN uses semantic masks to separate the latent space into regions to facilitate image editing. We now check the improvement in control over StyleGAN2 (trained on FFHQ) for various editing tasks. First, to edit a real image, we need to embed it in the GAN latent space. Here, we employ the ReStylepsp method to obtain the latent variables corresponding to the image.
The table below quantitatively compares the results of reconstructing images with the latent variables obtained by ReStylepsp.
For reference. StyleGAN2 in the bottom row is the result of training on the same dataset (CelebA-HQ) on which SemanticStyleGAN is trained. Looking at these reconstruction results quantitatively, there is not much difference between StyleGAN2 and reconstruction, we can see that the performance is comparable to reconstruction.
To confirm whether the image editing capability is improved, we compared the results using InterFaceGAN and StyleFlow, which are typical editing methods.
The image above is the result of the comparison.
The difference map between the edited result image and the original image is shown for four attributes (smile, bald head, bangs, and beard) where the model is created to manipulate the attributes with StyleFlow and InterFaceGAN and the local area is manipulated.
In the result using StyleGAN2, you can see that even irrelevant parts are modified except for the part you want to edit because of the entanglement of the latent space. On the other hand, the result using SemanticStyleGAN shows that the latent space is separated for each region, so the irrelevant parts are not modified and only the parts you want to edit can be manipulated.
summary
In this article, we introduced SemanticStyleGAN, which was adopted by CVPR2022.
Our method improves the performance of local editing by using semantic masks to separate the latent space into different semantic regions.
The paper shows that it works well on datasets of full-body images as well as on datasets of face images. However, this technique creates a local generator for each class, which has the problem of not scaling to datasets with too many classes. Also, as is not limited to this research, improving the performance and controllability of GANs can lead to people maliciously using these techniques. It would be difficult for a human to tell if an image synthesized using current GAN techniques is a composite or not. As mentioned in this paper, I felt that it is also very important to have a technique to distinguish whether the image is synthetic or not.
Categories related to this article