Torch.manual_seed(3407) Is All You Need
3 main points
✔️ Model accuracy and the choice of manual seed used for model initialization.
✔️ Find out if there are seeds that yield radically better results.
✔️ Relationship between use of pre-trained models and the variability induced by the choice of seeds.
Torch.manual_seed(3407) is all you need: On the influence of random seeds in deep learning architectures for computer vision
written by David Picard
(Submitted on 16 Sep 2021)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Deep learning models are sensitive to the initial state of the model. The same architecture trained with the same data and training procedure but different initial states could have two final models with a measurable difference in performance. But how much improvement can one expect with some "lucky" initial state(manual seed) over other initial states (manual seed)? Are there factors that mitigate the variability induced by the choice of the manual seed? We will try to answer these questions in this paper.
Significance and Methodology
Today, several papers are released on a regular basis to report the incremental progress in architecture designs and training. If some particular seeds indeed produce radically better results than others, it should be necessary for researchers to validate their work by conducting experiments with several seeds. This will be necessary so that only valuable research works get the attention of the research community.
In most cases, researchers who have access to powerful scientific clusters can afford to conduct these experiments. To show that, we limit ourselves to 1000 hours of V100 GPU compute which is much less than the compute available in most scientific clusters. The following table summarizes the experiments we conducted.
We first examine the distribution of validation accuracy on CIFAR-10 of 500 different seeds.
The left diagram shows the variation of accuracy with epochs. As we can see, the validation accuracy saturates after about 25 epochs, but the models trained with various seeds stop converging to one curve despite further training up to 40 epochs. The right diagram is the histogram/density plot of the final validation accuracies. The pointy center of the curve provides evidence that making some effort and selecting the final model after trying out a few different seeds is likely to be representative of what the model can do. However, since the density from 90.5% to 91% is still quite high, a randomly selected seed could easily result in a performance decrease (or increase) of about 0.5% in this case.
Looking for the Black Swan
Next, we examine the results of all 10000 CIFAR-10 experiments to see whether there are manual seeds that yield radically good (or bad) results. We found the minimum accuracy to be 89.01, and the maximum accuracy to be 90.83, a difference of 1.82%. Such a difference is considered significant by today's score-driven computer vision community, and it calls for the publication of a new paper in many cases.
Of course, it is rather unlikely that the higher or lower extremes could be obtained without scanning a significant portion of seeds. Meanwhile, the difference in the extremes obtained from a larger sample than 10000 could also be much larger.
Large Scale Dataset
Next, we examine whether using a pre-trained model reduces the randomness induced by the choice of seeds. The standard deviation was about 0.1% and the gap between minimum and maximum accuracies was about 0.5%. This is much less than the case of CIFAR-10, but we have to consider the fact that, all the other layers except the last linear layer have the same weights across all models. The training data and procedure were also exactly the same.
However, the smaller sample size of 50 raises some doubt on the validity of these results. Also, looking at the distributions of the 50 samples does not suggest that the gap would not have grown to more than 1% for a larger sample size. Nevertheless, this variation of 0.5% is still considered significant in the computer vision community.
It is quite clear that the choice of manual seed is significant. Deep learning research is rapidly moving forward and researchers are in a rush to publish their work. There are more than 10^4 submissions to major computer vision conferences each year. Surely the submitted models are not the same, but it is very likely that the majority of them did not put much effort to ensure that their results were not due to a lucky setup.
In order to solve such a problem, the research community must embrace some form of standard randomness study by varying seeds (maybe dataset splits too) and reporting the results. Future works could expand on our works by increasing the sample size and training the models longer using other datasets and model architectures.
Categories related to this article