What Is The Block Structure That Appeared In The Features Of The Model?
3 main points
✔️ Investigate the width and depth of deep neural networks
✔️ Discovered block structures in the features
✔️ Suggests that block structures are associated with many factors
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth
written by Thao Nguyen, Maithra Raghu, Simon Kornblith
(Submitted on 29 Oct 2020 (v1), last revised 10 Apr 2021 (this version, v2))
Comments: Accepted by ICLR 2021.
Subjects: Machine Learning (cs.LG)
The images used in this article are from the paper or created based on it.
first of all
For deep neural networks, accuracy improvements often occur with the simple approach of scaling their width and depth. While this approach is therefore fundamental, there is limited understanding of what changes in properties varying depth and width can cause and how they affect the model. Understanding this fundamental question is especially important as computational resources devoted to designing and training new network architectures continue to increase. Namely, how do depth and width affect the representations that are learned?
The contributions of the present paper are as follows.
- A method based on centered kernel alignment (CKA) is developed to efficiently measure the similarity of the hidden representations of Wide and deep neural networks using mini-batches.
- We applied the method to various network architectures and found that the representations of the wide and deep models exhibit a characteristic structure (we call this block structure). Furthermore, we investigated how the block structure changes between different pieces of training and found a link between the block structure and the over-parametrization of the model.
- Further analysis showed that the block structure corresponds to a hidden representation with a single principal component that explains most of the variance in the representation, and that this principal component is stored and propagated in the corresponding layer. We also showed that hidden layers with a block structure can be pruned with minimal impact on performance. - For models without block structure, the similarity of representation was found in the corresponding layers, but the block structure representation was found to be unique to each model.
- Finally, we examined how different depths and widths affect the output of the models, finding that the wide and deep models individually generate different errors.
Our experimental setup uses ResNet trained on the standard image classification datasets CIFAR-10, CIFAR-100, and ImageNet, and we adjust the width and depth of the network by increasing the number of channels and layers respectively at each stage. While changing these, we analyze the features.
Phenomenal similarity measurement using mini-batch CKA
One reason for the difficulty in analyzing hidden representations in neural networks is that
- Large in size
- The distributed nature of important features in the layer that may be dependent on multiple neurons
- No alignment between neurons in different layers
It is difficult to analyze for several reasons.
However, Centered kernel alignment (CKA) addresses these challenges and provides a robust way to quantitatively study the representation of neural networks by computing the similarity between pairs of activation matrices. Specifically, we use linear CKA, which was previously validated for this purpose by Kornblith et al. and adapted for efficient estimation using mini-batches.
CKA takes the representations from the two layers as input and outputs a similarity score between 0 (= not similar) and 1 (= similar). The figure below shows things very clearly.
We begin by examining how the depth and breadth of a model's architecture affect its internal representation structure. How are the representations of the different hidden layers of different architectures similar (or dissimilar) to each other?
As it turns out, the further to the right of the figure you go (the greater the width and depth), the more block structures emerge. The block structure is the emergence of a large set of successive layers with very similar feature representations. This appears as a yellow square in the heatmap.
We show the results of training ResNet with different depths (top row) and widths (bottom row) on CIFAR-10. We are calculating the representation similarity for all pairs of layers in the same model. Of course, since they are the same model, we can see the lines with high similarity on the diagonal. Also, as expected, the representation of residual connections is less similar in the grid due to skipping. Furthermore, we can see that the representation after the residual connection (the latter layer) is more similar to the other representations after the residual connection than the representation in the ResNet block. The same trend is observed in the model without residual connections. (Appendix Figure C.1 in the original publication)
Block structure and model over parameters
We find that the block structure appears as the depth and width of the model increases. The next question is whether the block structure is related to the absolute size of the model or the size of the model relative to the size of the training data.
In general, a model has more parameters than the number of samples in the training set. In other words, the number of samples in the training data is often much smaller. However, it has been reported that high performance can be achieved with holdout data even in this over-parameter condition.
The relationship between varying the width of the network and the data set is shown in the figure below. (Varying the depth of the network is shown in Figure D.2 in the original appendix.)
We can see that as the amount of training data decreases (in the column direction), block structures emerge in the narrower (lower left) networks. These results indicate that block structure in the internal representation occurs in models that are heavily over-parameterized compared to the training data set.
Exploring the block structure
We are conducting additional experiments for further consideration of block structures. It is quite interesting to see the relationship between the block structure and accuracy, the comparison of representations among models, and the error analysis of models using the block structure as a keyword. Since it is impossible to cover all of them, I will only introduce the relation with accuracy that led to the most interesting results. If you are interested in the other results, please be sure to check the original publication. Additional experiments were performed with linear probes (Alain & Bengio, 2016) to give further insight into the block structure.
Block structure and accuracy
We get to the point where we say that block structures store representations. Now we will investigate how these stored representations affect the overall task performance of the network, and whether the block structure can be broken down in a way that has minimal impact on performance. In other words, we will look at the relationship between block structure and accuracy. We introduce the relation to accuracy because we think you will be interested in it.
In the figure below, we have learned a linear probe that maps from the layer representation to the output class for each layer of the network. The graphs in the figures below show the accuracy of the linear probes for each layer before (orange) and after (blue) the residual connections. We can see that for the model without block structure, the accuracy monotonically improves throughout the network, whereas for the model with block structure, the accuracy of the linear probes hardly improves inside the block structure. Comparing the accuracy of the probes in the layers before and after the residual coupling, we see that the residual coupling plays an important role in preserving the representation within the block structure.
In this article, we have gained an understanding of how scaling the width and depth of deep neural networks affects the model. A key element is the emergence of a block structure. Looking at this block structure has led to several other results. Although omitted in this article, models with block structures can be cut off in the middle of the model with little effect on test accuracy, and when compared between different seeds, the rate of accuracy loss is related to the size of the block structure present, and block structures may indicate redundancy in model design. The similarity of the representation of its constituent layers may be used to compress the model, for example, and future analysis of the newly defined block structure may provide an opportunity to understand more clearly the optimal model and the relationship between accuracy and model.
The original book does a lot of analysis, so be sure to look at the original book if you're interested.
Categories related to this article