GAN Inversion With Transformer!
3 main points
✔️ Transformer-based GAN Inversion method
✔️ Outperforms existing methods in reconstruction quality, editability, and model size
✔️ Editing with reference images is also available
Style Transformer for Image Inversion and Editing
written by Xueqi Hu, Qiusheng Huang, Zhengyi Shi, Siyuan Li, Changxin Gao, Li Sun, Qingli Li
[Submitted on 4 Dec 2021 (v1), last revised 29 Mar 2022 (this version, v3)]
Comments: Accepted by CVPR 2022
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
Recently, StyleGAN has been able to generate high-resolution images, and there has been a lot of research on its application to various editing tasks of live-action images. To edit real images, it is necessary to find latent variables of StyleGAN from real images by a method called GAN Inversion.
- Ability to faithfully reconstruct the original image (reconstruction capability)
- Ability to manipulate only the attributes you want to edit while preserving the original identity and details (editability)
Satisfying these at the same time is a difficult problem.
Embedding of the StyleGAN. The latent space includes Z-space, W-space, W+space, etc. There are several candidates. Existing studies say that the choice of these candidates is important: Z-space and W-space are represented by a single 512-dimensional vector, while W+ space is represented by 18 vectors, each of which has 512 dimensions. This makes W+ space superior in its ability to represent image detail and reconstruction, but it also makes independent attribute editing difficult, as various dimensions are often intertwined concerning a single attribute.
To improve reconstruction and editing abilities simultaneously, we proposed a GAN Inversion method "StyleTransformer" using Transformer in this paper. The Transformer has been used in various domains including natural language processing with good results.
The image above is the result of using StyleTransformer to output a reconstructed image and an edited image. You can see that the reconstruction quality is high and the editing is well done.
This technique also allows us to prepare a reference image and transfer the attributes of the target image to certain attributes of the reference image.
The following diagram shows an overview of the Style Transformer framework.
First, the input image is generated by encoder E to generate image features F1~F3 in multiple resolutions; N different queries output from MLP access these features via the Transformer Block and gradually update them to the latent variable w of the generator.
All parameters of the encoder E, MLP, Transformer Block, and the initial value zn are trained so that the optimal latent variable w can be output.
The image above shows the structure of the Transformer Block.
The structure is similar to the conventional Transformer, and the design includes Multi-Head Self-Attention and Cross-Attention. The residual connection, normalization, and FFN module are also based on the structure of the conventional Transformer.
Here, a typical Transformer decoder often initializes the input query tokens randomly and keeps them as parameters. However, the distribution in W-space is complex and quite different from the Gaussian distribution, so training it in a general way does not work.
Therefore, we use a pre-trained MLP in StyleGAN to map the latent variable zn to wn to devise a way to avoid a large deviation from W-space. Also, the pre-trained MLP is not fixed at training time but fine-tuned.
The calculations are similar to those used in the traditional Transformer.
In Self-Attention, learning proceeds to find relationships between arbitrary pairs of input queries and link them together. By doing this, it captures the relationship between any latent variables wn.
Self-Attention alone only looks at the relationships between the latent variables and does not involve any of the image features.
Therefore, Multi-Head Cross-Attention is used to obtain information from image features F1~F3 with different resolutions. Specifically, the key and value from the image features and the query are calculated using the result of Self-Attention.
During training, the StyleGAN generator G is fixed and all other parameters are adjusted.
For the loss function, we use pSp, which is similar to the GAN Inversion method. See the pSp paper for details.
Image editing with Style Transformer
As mentioned at the beginning, it is important for GAN Inversion not only to have good reconstruction performance but also good editing capability.
Style Transformer allows you to edit not only attributes with labels but also edit specific areas with reference images.
Editing with reference images
One new Transformer Block is trained for editing with a reference image.
First, we train an attribute classifier C that takes W+ latent variables as input and outputs embedded features and labels for each attribute.
Next, we train a new Transformer Block as shown in the image above. First, the reference and target images are embedded into the latent space W+ by StyleTransformer respectively. Then, we input the latent variables of the reference image into the value and key of the Transformer Block, and the latent variables of the target image into the query, and output a new latent variable we.
By computing the loss function so that the attributes to be edited are aligned with the attributes of the reference image and the rest are aligned with the target image, a latent variable we with the desired edits can be generated.
First, the existing method results, the reconstruction results, and the attribute editing are shown in the following images.
The second row shows the reconstruction results. pSp is a GAN Inversion method with high-quality reconstruction results, but the results of the proposed method look as good as pSp. Also, e4e is a GAN Inversion method with high editing capability, and the quality of the reconstruction results does not seem to be so high.
Looking at the editing results, the proposed method seems to be able to separate each attribute better than e4e, which is considered to have better editing capabilities.
We have experimented not only with face images but also with car images, showing that both the quality of the reconstruction and the quality of the edited result are high.
The following table shows the results of a quantitative comparison with existing methods.
The reconstruction results are evaluated by the similarity between pixels (MSE, LPIPS) and the distance between the distributions of real and generated images (FID, SWD), and the proposed method has the highest results in all the indices. The editing results are also evaluated by FID and SWD, and the proposed method has the highest result.
In addition to the high quality of the resulting images, it also outperforms the other models in terms of model size and inference time, outperforming pSp and e4e on all metrics listed here.
Editing with reference images
The image above is the result of editing with a reference image. The diversity of the image is a little lower, but only certain attributes of the reference image can be reflected in the original image.
This time, we introduced Style Transformer, which was adopted by CVPR2022.
The method outperforms previous methods in various aspects, including high performance in both reconstruction and editing capabilities, which are important in GAN Inversion, as well as the small model size and short inference time.
As the GAN Inversion technology advances, it will be possible to perform image editing at high speed and high quality, so future GAN Inversion methods will also be of interest.
Categories related to this article