Very Deep Convolutional Networks For Large-scale Image Recognition

Image Recognition 28/12/2023

3 main points
✔️ An architecture with very small (3 × 3) convolutional filters was used to thoroughly evaluate networks of increasing depth.
✔️ These findings formed the basis for our submission to ImageNet Challenge 2014, where the authors' team secured first and second place in the localization and classification tracks, respectively.
✔️ The authors demonstrated that cutting-edge performance can be achieved on the ImageNet Challenge dataset using the traditional ConvNet architecture, with a significant increase in depth.

Very Deep Convolutional Networks for Large-Scale Image Recognition
written by Karen Simonyan, Andrew Zisserman
(Submitted on 4 Sep 2014 (v1), last revised 10 Apr 2015 (this version, v6))
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

This study investigated the impact of convolutional network depth on accuracy in image recognition. Importantly, the evaluation of networks with small convolutional filters showed that deep networks with 16 to 19 weight layers outperformed traditional configurations. These results led to success in the 2014 ImageNet Challenge, where the proposed model performed well on other datasets. Researchers aim to make the two most effective ConvNet models available to the public to facilitate research in deep visual representation.

Introduction

Convolutional networks (ConvNet) have recently been successfully used in large-scale image recognition. This is due to advances in large image data sets and high-performance computing systems. In particular, the ImageNet competition has contributed to advances in visual recognition technology. Convolutional networks are becoming increasingly popular and many improvements have been attempted. In this study, we show that the depth of convolutional networks is important and propose a method for building deep networks using small filters. The resulting networks are built with high accuracy, and their performance can be applied to other datasets. Finally, the state-of-the-art model is now available to the public and is expected to advance the research.

Architecture

During ConvNet training, the input is a fixed size 224 x 224 RGB image, and the only preprocessing is to subtract the average RGB value from each pixel. A small 3x3 filter is used in the convolution layer, with a stride of 1 pixel. Spatial pooling is done by the maximum pooling layer. The convolution layer is followed by three fully connected layers, the last of which is a soft max layer for ILSVRC classification. All hidden layers have ReLU nonlinearities and the network does not include local response normalization.

Configuration

In this paper, we evaluated five models (A through E) of convolutional network (ConvNet) configurations. These models are based on a general design and differ in depth (11 layers for A and 19 layers for E). The number of weight layers and the width of the layers vary with the depth of the network, starting at 64 for the first layer and increasing by a factor of 2 for each maximum pooling layer to reach 512.

Table 2 reports the number of parameters for each configuration. Despite the greater depth, the number of weight layers in the nets is not greater than the number of weight layers in the shallower nets with larger transformations.

Discussion

In this study, the convolutional network (ConvNet) was reconfigured to improve performance by using small 3x3 filters instead of the traditional large receptive fields. This allowed for the introduction of nonlinear rectification layers and reduced parameters. The introduction of smaller filters resulted in higher decision function discriminability, and the 1×1 convolution also improved nonlinearity. This was more effective than previous approaches and resulted in higher performance on deeper networks.

Classification Framework

Training

In this study, a mini-batch gradient descent with momentum was used to train ConvNet, with a batch size of 256 and momentum set to 0.9. Weight decay and dropout were used for normalization, and the learning rate was reduced incrementally. Initial weights were set starting with the shallow model and some layers were initialized as we proceeded to training the deeper architecture. Images were randomly cropped, and horizontal flips and RGB color shifts were added to enhance the training set.

Image Size

In this study, we used S as the scale representing the smallest edge of the ConvNet training image and tried two approaches, setting S either fixed or random. First, the model was trained at two fixed scales, S=256 and S=384. Second, in multi-scale training, each image was rescaled randomly to allow recognition of objects at a wide range of scales. Finally, a multiscale model was built based on the model trained at S=384 and fine-tuned with random scaling.

Test

During testing, the trained ConvNet is isotropically rescales the input image and applies the network densely over the rescaled test image. This yields a class score map across the entire image and ultimately a class score. The test set is flipped horizontally and the results of the original and flipped images are averaged. The full convolution network is applied to the entire image, eliminating the need to recalculate for each crop and improving testing efficiency. The use of multiple crops was considered, but it was determined that the increased computation time did not justify the increased accuracy.

Implementation Details

This implementation is derived from the C++ Caffe toolbox and allows training and evaluation on multiple GPUs. Multi-GPU training uses data parallelism, with each GPU processing the batch and computing the gradient, which is finally averaged. This produces results comparable to training on a single GPU. In our experiments, we used a system with four NVIDIA Titan Black GPUs, which took 2-3 weeks to train and was 3.75 times faster than an off-the-shelf 4-GPU system.

Classification experiment

Data-set

This section shows the image classification results achieved by the ConvNet architecture on the ILSVRC-2012 dataset. The dataset contains 1000 classes of images and is divided into three sets: training, validation, and test. Classification performance is evaluated on two measures: top 1 errors and top 5 errors, with the former indicating the percentage of images misclassified and the latter indicating the percentage of images that do not contain a correct answer among the top 5 predictions.

Single scale evaluation

First, we evaluate the performance of individual ConvNet models at a single scale, using the layered configuration described in the previous section. Q = S for fixed S and Q = 0.5( _Smin + _Smax ) for jittered S ∈ [ _Smin, _Smax ]. The results are shown in Table 3.

Experimental results comparing various configurations of convolutional neural networks (ConvNet) show that the presence or absence of a normalization layer and increasing depth affect classification error. Errors decrease with increasing depth, and nonlinear transformations and spatial context capture are also important. Deep models were also shown to be beneficial on large data sets, with deep nets with small filters outperforming. Scale jittering during training is also effective and helps to obtain multi-scale image statistics.

Multistage evaluation

In the evaluation of the ConvNet model, we tested the effect of scale jittering during testing. This technique involves rescaling the test images to different scales and running the model to compute the class posterior mean. To account for the possibility that the mismatch between training and test scales could affect performance, models trained at a fixed scale were evaluated at close sizes and simultaneously tested at a wide range of scales due to scale jittering during training.

Results show that scale jitter at test time improves performance over evaluating the same model at a single scale. The deepest configurations (D and E) show the best performance, suggesting that scale jitter is more beneficial than training with fixed minimum side S.

Evaluation of Prolific Crops

Table 5 compares the high-density ConvNet evaluation with the multiple-crop evaluation and also examines the complementarity of the two methods by averaging the softmax output. Slightly better performance is obtained when multiple crops are used, and the combination of the two outperforms each other. This is believed to be due to the treatment of different convolution boundary conditions.

COMBNET Fusion

In this experiment, the outputs of different ConvNet models were combined to improve performance through complementarity. Combining the different models resulted in an ILSVRC test error of 7.3%. Combining only the two best multiscale models reduced the error to 6.8%, with the best single model achieving an error of 7.1%.

Comparison with the latest technology

The author's deep ConvNet significantly outperformed previous generation models in the ILSVRC-2014 classification task, reducing the error rate to 6.8% using an ensemble of 7 models. This resulted in the best performance in the ILSVRC-2012 and ILSVRC-2013 competitions, significantly outperforming competitor submissions. In particular, the best results were achieved by the combination of the two models, achieving higher performance with fewer resources than many other models.

Conclusion

This study evaluated deep convolutional networks (up to 19 layers) in large-scale image classification. Using the traditional ConvNet architecture, state-of-the-art performance was achieved on the ImageNet Challenge dataset, showing that representation depth contributes to improved classification accuracy as depth increases. The model is also applicable to a wide range of tasks and datasets and performs as well as or better than complex recognition pipelines based on shallow image representations. This reaffirms the importance of depth in visual representation.