Catch up on the latest AI articles

We Finally Found Out! It Was Him!

We Finally Found Out! It Was Him!


3 main points
✔️ Style Transfer Study Reveals Cause of Puzzling Problem
✔️ Residual Connection is the cause
✔️ Feature maps with small entropy Style Transfer do not work well

Rethinking and Improving the Robustness of Image Style Transfer
written by Pei WangYijun LiNuno Vasconcelos
(Submitted on 8 Apr 2021)
Comments: Accepted by CVPR2021 (Oral)

Subjects: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)



Style Transfer is the process of creating a style of an image to a different content image. Since the advent of deep learning, this field has received a lot of attention.

We will explain the background part of this research. In the study of Style Transfer, it is known that pre-trained VGGs are effective. Even better. When we think about Style Transfer, it is easy to think that it can be improved by using a model with better accuracy than VGG. That's where Resnet, InceptionNet, and DenseNet came in. But here's the funny thing. The assumption was that the accuracy would improve, but to my surprise, the accuracy dropped significantly. At this point, the cause of the problem is unknown, and research is being conducted based on various hypotheses.

Figure 1. Style transfer sample diagram

The prevailing theory at the time was that

Resnet and InceptionNet, and DenseNet. The theory was that VGG features might be more robust. Then came the research that if we could improve the robustness by adversarial learning, even Resnet would have better style transfer. As it turned out, it worked. What we found here is that robustness is important for Style transfer.

However, this theory will be filled with mystery again in the next research. That is.

The study shows that style transfer is good even for VGGs with random weights.

This leads us to believe that the cause is not the robustness of the training or the data, but the structure of the model itself. In fact, in this study, the cause was found to be the structure of the model. However, at that time, it was not clear what the reason was for the structure. However, at that time, we did not know what and why the structure of the model determines whether the Style Transfer is good or bad.

Concept of Style Transfer

This section will explain the conceptual knowledge of Style Transfer. In the end, while keeping the outline of the content image. Since we only need to transform the style, we can solve it as "an optimization problem that looks for just the right point (minimization) that reflects the style while also taking care of the content".When Gatys et al. first proposed it in 2015, the optimization problem was the following equation. Let $\vec{p}$ be the content image of the input, $\vec{a}$. is the style image of the input, $\vec{x}$ is the image to be generated

Let p Let x⃗ be the input content image and style image, respectively, and let x⃗ be the image to be generated, then this

Content reconstruction

CNNs believe that higher layers of the network capture the higher-order content of the image. In content loss, a particular feature map is extracted from the content image $\vec{p}$ and the generated image $\vec{x}$ and its mean squared error is taken as the loss. $F^l_{i_j}$ is the activation of the position i,j of the feature map in layer l of the generated image, and $P^l_{i_j}$ is the activation of the content image.

Style reconstruction

The style of the image is represented by the correlation of the individual filter outputs in each layer. The correlation is given by the Gram matrix, which is the inner product between the two feature maps.

Then, the difference between the result of the Gram matrix of the feature map of the generated image and the Gram matrix of the original image of the style is taken and the mean square error is obtained.


Here we show the results for both the pre-trained model and the network initialized with random weights. The prefixes "r-" and "p-" indicate whether the model was randomly initialized or pre-trained on ImageNet, respectively.For the VGG and ResNet models, we use VGG19 and ResNet-50.

Result of style transfer

b to e show examples with r-VGG, p-VGG, r-ResNet, and p-ResNet networks. It can be seen that the performance varies greatly depending on the network architecture: compared to p-VGG, p-ResNet fails to transfer much more color patterns. This is even more evident for the random model "r-", where r-ResNet fails to transfer style.

Next, in order to investigate why the performance of the two architectures differs so much

  1. Using Residual Connection
  2. Folding with sizes varying between 1 x 1, 3 x 3, and 7 x 7
  3. Change depth to variable
  4. batch normalization
  5. Change the number of channels per layer
  6. Comparison of Fixed Stride 2 and Maxpooling

We will perform ablation studies on many network structures. (For various results, please see the Supplement to this paper)

Using Residual Connection

In this article, we will show the results of using Residual Connection, which is the most important result in the main topic of this article, "Why does Resnet lose accuracy with style transfer?

We show evidence of this claim by comparing results from several architectures after removing or adding Residual Connections. We build "NoResNet" by removing all Residual Connections. r-NoResNet appears to perform closer to r-VGG than r-ResNet. However, it is quite subtle. Next, we examined the effect of several other modifications to bring NoResNet closer to VGG.

  1. Replacing 7×7 convolution with VGG's 3×3 convolution
  2. Replaced bottleneck module with ResNet-34 basic block module without Residual Connection.
  3. Similar to VGG, added Maxpooling between each stage to reduce the size of feature map.

The resulting architecture is called "pseudo-VGG"; as shown in g, these modifications bring the performance of style transfer closer to r-VGG. However, the authors thought at this point that the removal of Residual Connection was the deciding factor, but I found both to be equally unsuccessful in the image.

However, I reintroduced Residual Connection to pseudo-VGG and created "pseudo-ResVGG". Looking at the result h, it is clear that Residual Connection is now destroying the previous contribution. In fact, pseudo-ResVGG has the worst result. In other words, ResNet turns out to perform poorly due to Residual Connection.

Why Residual Connection is undesirable

Why is Residual Connection not desirable for stylization? In this case, the optimization of style is based only on the gram matrix $G^l$ of the network for the original and synthesized styles. So we can say that it is dominated by the feature map. So we start by visualizing the activation of the network and the statistics of its gram matrix. The figure shows the activation of the last layer of each network of the r-model for 10 style images and the maximum of the Gram matrices $max_{i_k}$,$F^l_{i_k}$ and $max_{i_j}$,$G^l_{i_j}$. the average value and normalized entropy of $G^l_{i_j}$. The figure shows that the activation and Gram values behave in a similar way. In both cases, for architectures with Residual Connection (ResNet and pseudo-ResVGG), the maximum value increases with the depth of the layer and the entropy gradually decreases. This is different from networks without shortcuts (NoResNet and pseudo-VGG), where activation tends to decrease and entropy is nearly constant and much higher. in some cases, such as pseudo-ResVGG, the Residual Connection is introduced, the maximum entropy in the deep layer becomes larger, with an entropy close to zero. In other words, the activation is dominated by a decisive correlation pattern in a single feature channel.

The result that the entropy is small is also consistent with the explanation for the low accuracy of the style transformation: the only variable in the style optimization is the activation $F^l(x)$, which has a Gram matrix as similar as possible to that of the style image $F^l(x^s_0)$. So $x^s_0$. the Gram matrix obtained from $F^l(x^s_0)$ is "peaky" (low entropy), then the optimization is to make the peak equal to the matrix obtained from $x^∗$. In short, the remaining entries of the Gram matrix become almost meaningless.

proposed method

Now you can see how the proposed method is possible: if we use Softmax system to smooth the feature representation and increase the entropy, we can improve the cause and the accuracy of Style transfer will also be improved.

Therefore, the authors propose Stylization With Activation Smoothing (SWAG). Please refer to the paper for the actual formula and so on. It is a simple proposal that we actually add smoothing to activation.

SWAG results

The above graph $ResNet^∗$ is the actual result of the adaptation, and we can see that the maximum is suppressed and the entropy is increased, especially in the deeper layers. It is also the result of Style transfer.

It can be seen that the transfer is quite accurate despite the ResNet. In both cases, the SWAG adaptive model significantly improves the quality of the stylized images, in the sense that more sophisticated style patterns are transferred. r-ResNet∗ results approach those of r-VGG, and p-ResNet∗ also seems to outperform p-VGG. We next evaluate this quantitatively. Note that there are several methods for smoothing the activation and decreasing the entropy, such as softmax at different temperatures, nested softmax, and even multiplying by a small constant (<0.1), which proved to be effective in our experiments, but the authors chose the simplest Softmax was chosen because it does not require any hyperparameters and is simple. Again, I think the idea is to take into account the fact that the hyperparameters may have caused the improvement.

The following are the results of the user survey.

In the user evaluation, the SWAG adaptive model is overwhelmingly evaluated. The SWAG adaptive model is also overwhelmingly evaluated in user evaluations compared to VGG, which has been dominant until now.


The reason for VGG's dominance in style transfer, which has been shrouded in mystery until now, has finally been revealed. One of the reasons I learned about this problem in the first place was because of a graph in this article (Neural Style Transfer with Adversarially Robust Classifiers). That graph is shown in the figure below. This graph shows that Resnet, InceptionNet, and DenseNet can easily transfer non-robust features. In other words, the reason why Resnet does not perform well in Style Transfer (human vision feels uncomfortable) is that it does not transfer non-robust features easily.

It was a research area of adversarial examples, but it became a hot topic in the Style Transfer field around 2019 because the findings were consistent with research in Style Transfer.

Looking at the results of such a study, I really saw the strength of researchers who keep their antennae open to other fields and research various papers in detail.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us