ResNets Learning And Scaling Strategy For SOTA Performance!
3 main points
✔️ A set of training and scaling strategies to improve the performance of ResNets (and EfficientNets).
✔️ Introduce ResNets-RS which are up to 3x faster than EfficientNets.
✔️ Impressive performance on semi-supervised, transfer-learning, and video classification tasks.
Revisiting ResNets: Improved Training and Scaling Strategies
Written by Irwan Bello, WilliamFedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas,Tsung-Yi Lin,Jonathon Shlens, Barret Zoph
( Submitted on 13 Mar 2021)
Comments: Accepted to arXiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV )
First of all
When it was first introduced, the ResNet architecture greatly advanced the SOTA in several computer vision tasks. Although it has been long (~5 years) since they were first introduced, ResNet and variants of ResNet are still widely used for research and practical deployment.
Current research works mostly focus on architectural changes, even though they often make use of superior training methods and hyperparameters. These new models are often compared to older architectures trained using outdated methods(ResNet with ImgNet 76.5% top-1 accuracy).
We tried to train a canonical ResNet model using current SOTA methods and were able to increase the top-1 ImageNet accuracy from 79 to 82.2% and minor architectural scaling strategies improved the performance to 83.4%. Using these strategies, we also introduce ResNet-RS, a family of ResNet architectures faster(up to 2.7x on TPU and 3.3x on GPU) than EfficientNets. These strategies can also help EfficientNets and improves the top-1 Kinetics-400 video classification accuracy by 4.8% over the baseline.
Modifications to ResNet
We make some architectural changes and training changes to the original ResNet architecture which are described below
We make use of the ResNet-D modification and the Squeeze and Excitation(SE) modification in all bottleneck models. ResNet-D makes three changes to the ResNet architecture:1) the 7x7 convolution in the stem is replaced by three 3x3 convolutions,2) the strides for the first two convolutions in the downsampling blocks are switched, 3) the stride-2 1×1 convolution in the skip connection of the downsampling blocks is replaced by stride-2 2×2 average pooling followed by a non-stride 1×1 convolution. Also, the stride-2 3×3 max pool layer is removed from each block and the downsampling is done in the first 3×3 convolution of the next resnet block.
The SE layer scales the channels by first globally average pooling the feature map of a convolutional block and computing cross-channel interactions so that the network can adaptively adjust the weighting of each feature map. For all experiments, we use a ratio of 0.25.
Our training method matched that of EfficientNets with a few changes. We train for 350 epochs using RandAugment (translate, shear, color distortions), momentum optimizer, and Cosine LR scheduling. We also apply weight decay, label smoothing, dropout, and stochastic depth for regularization.
We trained a baseline ResNet-200 model which gets 79.0 top-1 accuracy. Through improved training methods (highlighted in purple and green), we were able to achieve 82.2% accuracy and the SE and ResNet-D architectural changes boosted the accuracy to 83.4%. The training methods alone contributed to 2/3rd of the performance improvement which shows their influence on ImageNet performance.
We also found it necessary to reduce weight decay when using other regularization like dropout DO), stochastic depth (SD), label smoothing (LS), and RandAugment (RA). Evidence has been shown that data augmentation reduces the L2 norm of the weights just like weight decay does, making the effects of weight decay redundant.
We tested the ResNet model on ImageNet using different width multipliers [0.25,0.5,1.0,1.5,2.0], depths of [26,50,101,200,300,350,400] and image resolutions of [128,160,224,320,448]. All the models were trained for 350 epochs. It was observed that the error decreases as a power-law with the increase in FLOPS in the lower-FLOPs regime(up to 10^9). In the higher-FLOPs regime, the trend breaks, and increasing the FLOPs can be detrimental in some cases.
The above diagram shows the depth scaling and width scaling across different image resolutions [128,160,224,320] for 10, 100 and 350 epochs. Here, all models are trained for four different depths [101,200,300,400] and a width multiplier of [1.0x, 1.5x, 2.0x]. It was found that the best performing scaling strategy depends on the training regime. As we can see in the rightmost figure, when training for higher epochs (350), depth scaling is more beneficial than width scaling. Likewise, width scaling is found to be more beneficial in the low-epoch regime. Therefore, the common practice of generating scaling rules in small-scale regimes must be avoided as those rules might not generalize well for larger regimes or longer training periods. It is necessary to test a small subset of models across different scales, with full training epochs to understand the optimal scaling strategy.
We also found that larger image resolutions can be detrimental for smaller models. Therefore, it is suggested to scale image resolutions more gradually than previous models like EfficientNet.
Experiment and Evaluation
Using the training and design strategies discussed above, we trained a family of ResNets called ResNet-RS and conducted several studies on them.
Speed-accuracy of ResNet-RS and EfficientNets
Although ResNet-RS have a larger parameter count and FLOPs than their EfficientNet counterparts, they are up to 1.7-2.7x faster on TPU. FLOPs contain no information about the memory access cost (MAC) and degree of parallelism, both of which are crucial factors in determining the speed of the model. Multi-branch modules consist of a number of fragmented operations that do not work well with modern parallel computing devices like GPUs and TPUs. EfficientNet also consumes more memory due to the larger number of activations. E.g. a ResNet-RS model with 3.8x more parameters than EfficientNet-B6 consumes 2.3x less memory for ImageNet accuracy.
Semi-supervised learning with ResNet-RS
ResNets-RS was trained on the combination of 1.2M labeled ImageNet images and 130M pseudo-labeled images, where EfficientNet-L2 model with 88.4% ImageNet accuracy is used to generate the pseudo labels.
We found that ResNet-RS are very good self-supervised learners. While gaining better top-1 accuracy than EfficientNets, they are about 5 times faster.
Transfer Learning with ResNet-RS
We compare the transfer performance of self-supervised SimCLR and SimCLRv2 with standard supervised ResNet and our improved supervised training strategies (RS). To try to match SimCLR's training setup as much as possible i.e RandAugment, label smoothing, dropout, decreased weight decay, and cosine learning rate decay for 400 epochs but do not use stochastic depth or exponential moving average (EMA) of the weight.
Our improved supervised representations (RS) outperform SimCLR on 5/10 of the downstream tasks and SimCLRv2 on 8/10 tasks.
Extension to video classification
Our scaling and training strategies can also be used for video tasks. The training strategies improve the baseline from 73.4% to 77.4% (+4.0%). The ResNet-D and Squeeze-and-Excitation architectural changes further improve the performance to 78.2%.
Simple strategies like the ones introduced in this paper can result in significant performance improvements across various tasks. This paper has shown that it is essential that research works distinguish between improvements from architectural changes and training methods. Improvements from training methods do not always generalize well and combining these two changes together makes model comparisons difficult. Moreover, in addition to parameter count and FLOPs, it is also important to report the latencies and memory consumptions of models. These codes of conduct will undoubtedly enable research work to advance faster.
Categories related to this article