Infinite Resolution! Explaining The Latest Super-resolution, Synthesis And Enhancement Model, InfinityGAN!

GAN (Hostile Generation Network) 24/08/2021

3 main points
✔️ Increase image resolution without limit at low cost by patching
✔️ Generate and synthesize seamless images taking into account both global and local factors
✔️ Proposes interesting network structures for super-resolution, image synthesis, and image enhancement

InfinityGAN: Towards Infinite-Resolution Image Synthesis
written by Chieh Hubert Lin, Hsin-Ying Lee, Yen-Chi Cheng, Sergey Tulyakov, Ming-Hsuan Yang
(Submitted on 8 Apr 2021)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper or created based on it.

Project page: this https URL

first of all

In this article, I would like to introduce InfinityGAN ( InfinityGAN: Towards Infinite-Resolution Image Synthesis ), which has been recently released on arxiv.

Although this InfinityGAN has not achieved SOTA by itself and has become a hot topic, its proposed network structure is interesting because it has potential applications in various fields such as super-resolution, image synthesis, and image enhancement. First, let's take a look at an image generated using InfinityGAN.

This is an image taken from a paper, but I think it looks strange at first glance. This is the kind of image you usually see in super-resolution GAN papers.

Compared to this, the image generated by InfinityGAN has a higher resolution, but at the same time, the range of the image is very wide, which is somewhat unrealistic.

This is because InfinigyGAN performs super-resolution, image synthesis, and image enhancement simultaneously. It not only increases the resolution but also combines several image patches at the same time and performs texture synthesis. The previous image is 1024 x 2048 pixels, but it is actually a combination of 242 image patches, each of which has been combined.

The problem with existing super-resolution models is that they use a very large resolution image as the teacher data. This makes it impossible to generate images with higher resolution than the teacher data, and above all, the computing cost is too high.

In addition, existing texture synthesis models such as SinGAN and InGAN can generate high-resolution images in various sizes, but they do not learn the structure of the image itself, and the generated images look like a repetition of the same texture.

InfinityGAN solves the above problems by iterating over small image patches and considering global, local, and texture factors. Training on a small set of image patches keeps the computing cost low, and also achieves properties that combine aspects of super-resolution, image composition, and image enhancement. Let's take a closer look at the network structure that produces such interesting results.

proposed method

As is often mentioned in Attention and elsewhere, images need to be considered both globally and locally. In the big picture, images should be coherent and contextual, and they should be relatively compact (not complex). When we see an image of a medieval landscape, we should be able to say, "Oh, it looks medieval. It is necessary to keep that "medieval look" in the whole image somehow.

Locally, as seen in Convolution and others, a close-up image is defined by the structure and texture of its local neighborhood. The structure content represents the objects, shapes, and their arrangement in the local region. Once the structure is defined, the second step, conditional on these, is to consider the texture. This structure and texture, while local, must also be consistent with the global picture.

Taking all of this into account, it is possible to generate an image with infinite resolution. First, once the global picture is determined, the local structures and textures can be extended spatially infinitely, as long as they follow the context of this picture.

outline

Based on the above analysis, InfinityGAN is composed of two parts: structure synthesizer Gs for modeling the whole image and texture synthesizer Gt for modeling the local texture. In addition, we use low-resolution image patches for training. This framework is shown in the figure below.

Four latent variables control the generation process. The global latent variable Zg is given to both Gs and Gt to allow each image patch to consider the whole image; Gs renders the structure of each patch at the location specified by the coordinate grid c. Gt renders the structure of each patch at the location specified by the coordinate grid c. Gt renders the structure of each patch at the location specified by the coordinate grid c. Local variations of the patches are modeled using the local latent code Zl. Once the structure is defined, there can be multiple textures, so each layer of Gt is given an additional condition Zn to model local fine detail that is not present in Zg. Let pc be the patch generated at position c. The generation process can be described as follows, where Zs represents the latent variables of the structure.

Structure Synthesizer

Structure Synthesizer is an implicit function implemented in a neural network. Its purpose is to sample an implicit representation conditioned on the global latent variable Zg and the local latent variable Zl and to generate a structure at the queried position c.

The global latent variable Zg serves as a global holistic representation; Zg has sampled once from the unit Gaussian distribution and injected into every layer and every pixel of Gs by feature modulation.

The local latent variable is represented by Zl. The local variation is independent of position in the spatial dimension, so for each spatial position in Zl, it is sampled independently of the unit Gaussian prior distribution. By sampling independently of the unit Gaussian distribution, we can form a volume tensor that is spatially infinitely scalable. This Zl is used as input to Gs.

Subject to the last sampled implicit representation being Zg and Zl of arbitrary size, the coordinate c acts as a search query to obtain the region to be retrieved from the implicit image. Let T be the period of the sinusoidal coordinate, we shall encode c as follows.

In addition, a mode-seeking diversity loss is employed between the local latent variables Zl1 and Zl2 to prevent the model from ignoring Zl and repeatedly generating iterative structures.

We also use the technique of feature unfolding to allow Gs to consider a wider range of information that Zl and c. Given an intermediate feature f in Gs, the feature map u obtained using k × k feature unfolding is

where " Concat(-) " concatenates unfolded vectors in the channel dimension. By using feature expansion, c turns into a grid of coordinates, instead of a simple triplet.

Texture Synthesizer

Now you are almost there. Next, let's talk about Texture synthesizers.

The Texture Synthesizer uses the famous StyleGAN2 model. First, it replaces the fixed constant input with zS and injects random noise using zn to model finely random textures. Next, we take zg as the input to the mapping layer, which projects a single zg into a layer-by-layer style zT using a multilayer perceptron. The style zT is then injected into all pixels in each layer by feature modulation. Finally, all zero paddings are removed from the generator, as shown in the figure below.

We remove all zero paddings for three main reasons. First, the StyleGAN2 model relies heavily on the positional encoding of the CNN through zero-padding of the generator, which means that it stores structural information from the training images.

Second, positional encoding with zero paddings becomes an important issue for generalizing the model to synthesize images of arbitrary size with arbitrary size input latent variables. In the third column of the above figure, we show that when the input latent variable of the StyleGAN2 grader is expanded multiple times, the central part of the feature does not receive the expected coordinate information from the padding, resulting in an extensive repetitive texture in the central part of the output image.

Finally, the presence of padding hinders G from generating independent patches that can be synthesized. Therefore, we remove all padding to remove all positional information from the generator.

These innovations make it easy to synthesize images of arbitrary resolution. In this way, we have enhanced the model to rely entirely on the structural features provided by GS, while allowing GT to serve only to model texture-related details.

Training

The discriminator D of InfinityGAN is the same as that of StyleGAN2. The entire network is trained using the loss function of StyleGAN2, non-saturating logistic loss, R1 regularization, and path length regulation. In addition, G and D are trained with an auxiliary task that predicts the vertical position of the patches to encourage the generation of conditional distributions that also follow the vertical direction of the image.

Therefore, the overall loss function of InfinityGAN is given by

experiment

The figure above shows a comparison with other methods in the Extended Resolution task. It is important to note that this is not just a super-resolution or outpainting task, but a unique task that synthesizes and enhances images while also improving image quality. The performance is calculated from the difference between the generated high-resolution image and the original image by reverting to the training resolution.

From the above figure, it can be seen that as the image is expanded 4 times or 8 times, the performance of InfinityGAN becomes better than the other methods because it captures the global image features. In the figure below, we can also see that InfinityGAN succeeds in synthesizing the image without repeating the texture, while retaining the features of the whole image, unlike the other methods.

Furthermore, in the figure below, by changing the local latent variable Zl and the texture latent variable Zt, we can see that the Structure Synthesizer and Texture Synthesizer can model the structure and texture separately and separate their roles.

In addition, the image below is a composite of 258 independently generated patches in four different styles. This image, which is a fusion of multiple styles, also shows the fun of InfinityGAN.

In addition, the following figure shows the superiority of InfinityGAN for the Out painting task: by adding InfinityGAN to the In&Out model for Out painting, we can clearly improve the accuracy and achieve SOTA.

summary

What did you think? Rather than simply excelling in some tasks, we obtained interesting results by considering patch-based and global-local contexts. It may be developed as new knowledge for various fields such as super-resolution, style composition, image enhancement, etc. in the future.

In addition, many supplementary materials are available in this paper. Also, there is a project page available. If you want to see more visual details, or if you want to know the network structure of the generator or the implementation part of the style synthesis, you may want to refer to the project page.