There Was An Implicit Error?
3 main points
✔️ A major error existed in the calculation of FID
✔️ FID was found to be affected by image format
✔️ It was recommended to use bicubic interpolation of PILs
On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation
written by Gaurav Parmar, Richard Zhang, Jun-Yan Zhu
(Submitted on 22 Apr 2021)
Comments: Accepted by arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
First of all
This paper is a study that carefully evaluates the variability problem mentioned in FID, which is used to evaluate the accuracy of GAN. Even if you are not doing research using GANs, you should definitely take a look at this paper because it contains information that may affect a relatively large number of people. Of course, if you're doing research using GANs, this is a must-read!
We will briefly explain how to evaluate the accuracy of GAN.
The first thing to consider is what does it mean to have high GAN accuracy? That is, has it generated a beautiful image? What if you can generate a face image that people can't determine? The problem that arises here is that we can't compare the obvious with accuracy. If we can compare, we can evaluate the accuracy, but for example, how do we measure the "realness" of a face image (fake) generated by GAN? This is why we developed
ID(Inception Score, 2016): This is an evaluation index using the Inception Network learned in Imagenet. Simply put, it is like whether the image generated by GAN is easily recognized as a real object by the Inception Network trained on real images.
However, this method has a big flaw in that it does not take into account the distribution of real images. So, the famous FID (Fréchet Inception Distance) was proposed to compare with the distribution of real images.
The idea of FID is that we can calculate the distance between the distribution of real images and the distribution of generated images by measuring the Fréchet distance, which is a disadvantage of FID that the distribution of real images is not taken into account as mentioned earlier. As shown in the figure, the Fréchet distance is measured between the features of the real image and the generated image.
The FID mentioned in the review above actually has some problems.
- There is an implicit image resize in the FID calculation. (Resize to ~ in the above figure is the relevant part. )
This must be made to 299 x 299 to be included in the Inception Network. And this resizing depends on the framework and is always done at the time of calculation. Basically, it's unavoidable.
- There is no fixed FID implementation, and it depends on the researcher.
Therefore, even if the same FID is used, there are variations in the values depending on which script is used.
These problems are well known. But in conclusion, this paper was full of libraries where errors existed in the calculation of FID in the first place. I was very surprised here.
Library error checking
This is the result of comparing the actual downsampling method of each library (left figure). What was done was to downsample a 128x128 image with a circle to 16x16. Obviously, the downsampling fails except for PIL, which has an aliasing artifact. The result on the right also clearly shows the difference in the noise image after processing.
The existence of such a library-dependent error in resizing may have some effect on the FID calculation. In this article, we will examine this issue.
Effect of FID by resize function
We want to see the effect of resizing, so both inputs are real images. So the distribution is consistent. Evaluate the results (table on the right) when the resize portion is changed for each library. (Ideally, the images are the same, so the FIDs match for all libraries.)
Of course, the PIL-bicubic pair will be -. Although there are some errors, FID is correct for all libraries except PIL-box. However, the results of the other libraries are clearly bad. In other words, there is a difference in accuracy depending on which library is used. Also, StyleGAN2 on the right side shows the actual generated image. There is a big error in the values of the libraries other than PIL.
Results of the effect of image storage format on FID
As a matter of course, the amount of information in an image differs depending on whether it is saved as PNG or JPEG, so the authors think that there may be an effect on FID here as well, and consider this as well. The resizing is fixed in PIL-bicubic and the input image is deliberately compressed to JPEG for consideration.
Looking at the results, the worse the quality setting of JPEG compression (JPEG-00, the lower the value of the 00 part), the worse the FID. Of course, the result is the same when we examine the image generated by StyleGAN2, and PSNR did not have as big a degradation as FID. It can be said that there is at least an effect of JPEG compression. I think what I'm trying to say here is not that compression is bad, but that it is affected by the storage format, so you basically have to match the storage format with the data set you are using. Next, I will briefly explain.
To begin with, the face image data set is FFHQ, which is stored in PNG format. So you can see that it is affected by JPEG compression. Also, the generated image is a continuous value (float) at the time of generation, but it must be converted to a discrete value (int) at the time of storage. Therefore, the result of actually inputting a float into Inception is also shown, but it can be seen that the effect is small.
The study also considers other LSUN Church. This data set is stored in JPEG-75. Looking at the results, it is important to note that the trend is different from the above, although for some reason the results are better with some noise, and the FID is affected by the storage format.
However, based on these results, the authors recommend the use of PNGs. Perhaps PNG may be recommended for future CV studies in AI. Also, while there is some impact from the discrete values of the image, the impact is small, similar to FFHQ.
The effect of the save format on FID is not as large as the effect of the library. Therefore, we fixed the save format as PNG and re-experimented the FID when various libraries were changed. We examine the results with PIL-bicubic, official Tensorflow implementation, and unofficial Pytorch implementation, which can be resized correctly. The numbers in parentheses represent the difference from the smallest value.
Interestingly, the Pytorch results had the smallest FID in all cases. In other words, there is an unfairness in calculating the FID with Pytorch, as it provides a more favorable comparison. More interestingly, the difference between the models is the same regardless of which implementation library is used. The accuracy of the model itself is fine as long as the same library implementation is used.
Furthermore, if we compare FID-correct (PIL-bicubic) and FID-buggy (=pytorch), we can see that FID is smaller in pytorch implementation at all training steps. . (However, FID-correct and FID-buggy are not perfectly correlated.)
Results by Dataset
We will now carry out the previous considerations in Imagenet and look at FID in general data.
In this result, we did not see as much degradation of FID as FFHQ. It is possible that Imagenet has a lower resolution and is less affected by the resizing. In fact, FFHQ is 1024x1024, so the resolution is less than half, but Imagenet has almost the same resolution.
Also, looking at the effect of the degree of downsampling, it is clear that the larger the ratio of downsampling, the greater the effect on FID.
In this study, I found that FID is affected by the wrong resize implementation of the library. However, since it is not extremely affected by the experiment on Imagenet, I think that it is eventually affected by the occurrence of aliasing artifacts as mentioned in the first background section. Of course, it is necessary to correct the errors in the library implementation, but what we should pay attention to in the future is that if we don't examine whether some artifacts are generated unknowingly when processing, our research will continue to progress without being noticed.
Also, there are some metrics like LPIPS that don't resize, so I'd like to pay attention to those too. This is because the author Richard Zhang is the one who proposed LPIPS.
Categories related to this article