With Greater Distance Comes Worse Performance: On the Perspective of Layer Utilization and Model Generalization

James Wang¹ &Cheng-Lin Yang¹ ¹CyCraft AI Lab, Taipei, Taiwan
{james.wang, cl.yang}@cycraft.com

Abstract

Generalization of deep neural networks remains one of the main open problems in machine learning. Previous theoretical works focused on deriving tight bounds of model complexity, while empirical works revealed that neural networks exhibit double descent with respect to both training sample counts and the neural network size. In this paper, we empirically examined how different layers of neural networks contribute differently to the model; we found that early layers generally learn representations relevant to performance on both training data and testing data. Contrarily, deeper layers only minimize training risks and fail to generalize well with testing or mislabeled data. We further illustrate the distance of trained weights to its initial value of final layers has high correlation to generalization errors and can serve as an indicator of an overfit of model. Moreover, we show evidence to support post-training regularization by re-initializing weights of final layers. Our findings provide an efficient method to estimate the generalization capability of neural networks, and the insight of those quantitative results may inspire derivation to better generalization bounds that take the internal structure of neural networks into consideration.

1 Introduction

Modern deep neural networks are extremely over-parameterized and thus capable of memorizing training samples instead of extracting generalizable features. Under the traditional bias-variance trade-off theory, complex networks should easily overfit training data. However, in recent years, widespread adoption and success of neural networks in image classification, natural language processing and several other fields have proved deep networks capable of learning generalizable features, which disagrees with VC-dimension based explanations of model complexity.

Theoretical works also proposed the PAC framework Valiant (1984) to explain the concept of learnability. The subsequent research McAllester (1998) constructed PAC-Bayes theorems, which allowed works such as Dziugaite and Roy (2017); Jiang et al. (2020) to derive tighter bounds for generalization of neural networks using different priors. While those bounds are often magnitudes better within controlled environment, they are still loose or vacuous in many cases Wilson and Izmailov (2020).

Apart from theoretical derivations, Belkin et al. (2019); Nakkiran et al. (2020) have discussed the “double descent” phenomenon, where for both data complexity and the neural network capacity. The U-shaped-like risk curve only appears in the under-parameterized regime, and the testing risk constantly improves after crossing the interpolation threshold. Combining this with the monotonically decreasing training risk, we can find that generalization error peaks at the interpolation threshold, and the error decreases as models move away from it.

Meanwhile, another line of empirical researches focuses on the layer-wise view of models. Zhang et al. (2019) showed that different layers of a neural network have different importance to the entire model. Baldock et al. (2021) attempted to explain model behaviour with classification depth of individual data samples. At the same time, Stephenson et al. (2021) leveraged replica-based mean field theory manifold analysis to evaluate the differences between layers and explain generalization.

Main contributions of this paper are listed as follows:

•

We illustrate that the utilization (distance of trained weights to their initial values) for deeper layers predicts the amount of non-generalizable information learnt by models with high confidence. Thus, it can be used to predict whether a model overfits.
•

We provide evidence to support post-training regularlization by reinitializing part of the weights of a trained neural network.
•

We demonstrate that deep neural networks follows a top-down order during the learning process, and refrain from using deeper layers when early layers are already capable of handling the provided training dataset.
•

The definition of model resilience is formalized as an additional method to measure the ability of a model to extract general patterns within training data.

2 Related Works

Generalization of deep neural networks has been discussed in several prior research works. Zhang et al. (2017) showed that neural networks are often capable of memorizing randomized data and suggested traditional methods are not capable of explaining a neural network’s ability to generalize. Among research on neural network behaviours, Belkin et al. (2019) were the first to study and coin the name double decent. Following their results, Nakkiran et al. (2020) conducted thorough experiments on double decent for neural networks across several different settings, and proposed the “effective model complexity” theory to explain their results. Aside from double descent, Zhang et al. (2019) found that different layers of neural networks tend to behave differently, but did not give further explanations for their results. Recent works including Baldock et al. (2021); Stephenson et al. (2021) started exploring the relationship between generalization and behaviour of individual layers, but only focused on the dynamics of the learning procedure. Our work extends this line of research and studies relationship between layer characteristics and generalization. Different from previous works, which monitor fluctuation of generalization error throughout the training procedure (epoch-wise double descent), we focus on comparing models, trained till convergence, with different neural network sizes and data complexity.

3 Terminology and Hypothesis

3.1 Definitions

Definition 1 (Model).

In this paper, the term model is strictly used to describe the combination of a hypothesis function class $\mathbb{F}$ (it is equivalent to the neural network architecture in our case), $n$ training samples (S_X, S_Y) where $\textbf{{S\textsubscript{X}}}=\{x_{i}\}^{n}_{i=1}$ and $\textbf{{S\textsubscript{Y}}}=\{y_{i}\}^{n}_{i=1}$ , an optimizer O, and a loss function j( $\hat{y}$ , $y$ ) where $\hat{y}$ is predicted label and $y$ is target label. Additionally, we define $\mathbb{F}$ ^$\prime$ as the final function selected from hypothesis class $\mathbb{F}$ after training, and J(f; S_X,S_Y) as the average loss over (S_X, S_Y) with respect to function f.

\displaystyle\textbf{{J}}(\textbf{{f}};\textbf{{S\textsubscript{X}}},\textbf{{S\textsubscript{Y}}})=\frac{1}{n}\sum_{i=1}^{n}\ \textbf{{j}}(\ \textbf{{f}}(x_{i}),y_{i})

(1)

Definition 2 (Neural Network Capacity).

We define the neural network capacity as the VC-dimension of the network. This definition captures the upper bound of a neural network’s ability to fit training samples while only considering the neural network size.

Definition 3 (Data Complexity).

The complexity of data C_S is defined as the entropy H(S_X, S_Y), which can be decomposed as follows.

\displaystyle\textbf{{H}}(\textbf{{S\textsubscript{X}, S\textsubscript{Y}}})

\displaystyle=\textbf{{H}}(\textbf{{S\textsubscript{X}}})+\textbf{{H}}(\textbf{{S\textsubscript{Y}}}\lvert\textbf{{S\textsubscript{X}}})

(2)

H(S_X) describes the “amount” of samples $n$ in S_X, and H(S_Y $\lvert$ S_X) characterizes the “perplexity” of samples, which is the difficulty of extracting patterns from S_X to decide corresponding labels S_Y.¹¹1A more formal definition for C_S is included in the technical appendix.

Definition 4 (Effective Model Complexity).

The effective complexity of a model with respect to the neural network $\mathbb{F}$ , loss function J, optimizer O, tolerance $\epsilon$ , and a fixed training epoch $e$ is illustrated as follows.

\displaystyle EMC_{\mathbb{F},\textbf{{J}},\textit{{O}},\epsilon,e}\coloneqq\max\ \{\textbf{{C\textsubscript{S}}}\ \lvert\ \mathbb{E}_{\textbf{{S}}\sim\textbf{{D}}^{n}}[\ \textbf{{J}}(\mathbb{F^{\prime}};\textbf{{S\textsubscript{X}}},\textbf{{S\textsubscript{Y}}})]\ \leq\epsilon\}

(3)

where $\mathbb{F^{\prime}}$ is the model learnt from $\mathbb{F}$ .

Our definition is similar to the one proposed by Nakkiran et al. (2020), but refined in two parts. First, training procedure $\tau$ in the original paper is split into J, O, $e$ , and specifically $e$ is decoupled from the rest of the components. The reason behind separating $e$ from the rest of the training procedure is that the fluctuation of generalization error during training is shown to be largely affected by the ordering of features in Stephenson and Lee (2021), and Chen et al. (2021) proved it is possible to handcraft samples that exhibit epoch-wise multiple descent behaviour for linear models. We drew the conclusion that epoch-wise and model-wise double descent must have different root causes, and only focused on model-wise double descent in this work. Thus, it is required to choose a large enough $e$ , and only measure EMC after the model is trained till convergence.

The second difference is we adopted data complexity C_S instead of number of training samples $n$ . This is necessary for our experiment setup, and will be shown in section 4.

Definition 5 (Layer Contribution).

Contribution to the layer is defined as the difference between measurement M on $\mathbb{F^{\prime}}$ and $\mathbb{F^{\prime}}$ ⁰_L, where $\mathbb{F^{\prime}}$ ⁰_L is the model with weights of layer L reinitialized to its initial values. M can be any metric that takes a function f and any amount of additional arguments.

\displaystyle\textbf{{M}}(\mathbb{F^{\prime}}\textbf{\hbox to0.0pt{\textsuperscript{0}\hss}\textsubscript{L}};\ \textbf{{*args}})-\textbf{{M}}(\mathbb{F^{\prime}};\ \textbf{{*args}})

(4)

For instance, taking the average loss J over (S_X, S_Y) as M can be written as follow.

\displaystyle\textbf{{J}}(\mathbb{F^{\prime}}\textbf{\hbox to0.0pt{\textsuperscript{0}\hss}\textsubscript{L}};\ \textbf{{S\textsubscript{X}}},\textbf{{S\textsubscript{Y}}})-\textbf{{J}}(\mathbb{F^{\prime}};\ \textbf{{S\textsubscript{X}}},\textbf{{S\textsubscript{Y}}})

(5)

Layer contribution measures how the training of a specific layer affects performance of a model on chosen metric M.

Definition 6 (Layer Utilization).

Utilization of a layer is defined as the $l_{2}$ -distance between trained weights W^$\prime$_L and its initial weights W⁰_L.

\displaystyle\lVert\ \textbf{{W}}\textbf{\hbox to0.0pt{\textsuperscript{$\prime$}\hss}\textsubscript{L}}-\textbf{{W}}\textbf{\hbox to0.0pt{\textsuperscript{0}\hss}\textsubscript{L}}\ \rVert\textsubscript{2}

(6)

This definition is equivalent to treating the initial weights of a neural network as a prior, and measuring how much a model strays from the prior after training.

Definition 7 (Generalization).

Generalization is defined as the gap between the inference results on training and testing data.

\displaystyle\lvert\ \textbf{{J}}(\mathbb{F}\hbox to0.0pt{\textsuperscript{$\prime$}\hss}\textsubscript{}\ ;\ \textbf{{S\textsubscript{X\textsubscript{train}}}},\textbf{{S\textsubscript{Y\textsubscript{train}}}})-\textbf{{J}}(\mathbb{F}\hbox to0.0pt{\textsuperscript{$\prime$}\hss}\textsubscript{}\ ;\ \textbf{{S\textsubscript{X\textsubscript{test}}}},\textbf{{S\textsubscript{Y\textsubscript{test}}}})\ \rvert

(7)

Definition 8 (Resilience).

Resilience is the ability to cast doubt on mislabeled training samples, and recognize the “correct” labels instead.

To measure the resilience, we have to train on data with a certain proportion of intentionally incorrect (corrupted) labels, and test on the correct (recovered) labels after training ends.

\displaystyle\textit{{J}}(\mathbb{F}\hbox to0.0pt{\textsuperscript{$\prime$}\hss}\textsubscript{}\ ;\ \textbf{{S\textsubscript{X\textsubscript{recovered}}}},\textbf{{S\textsubscript{Y\textsubscript{recovered}}}})

(8)

The difference between generalization and resilience is discussed in appendix A.

3.2 Hypothesis

Hypothesis 1 (Monotonic Contribution).

For fully connected neural networks without skip layers, the contribution of neural networks layers follows a monotonic pattern, where early layers (closer to input) start taking effect early on, and deeper layers only start contributing when trained on data with higher complexity.

Hypothesis 2 (Final Layer Utilization as EMC).

Based on Hypothesis 1, we deduced that the utilization of the final layer in a neural network should provide an estimation for EMC.

4 Results

Refer to caption — Figure 1: Contribution of each model and layer to training loss. Each block (separated by vertical white lines) represents the contribution of separate layers for different model sizes. Y-axis ticks show the noise ratio (0.5 means 50% of training labels are shuffled), and x-axis ticks are composed of two fields separated by an underscore. The first field represents the hidden layer size, and the second field indicates the depth of the layer (50_1 corresponds to the first layer in a model with the hidden layer size 50). Ladder-like patterns emerge on the bottom right of each block, where the contribution of deeper layers gradually grows as the noise ratio in the training data increases. Log scale is taken to allow the difference between smaller values to be visible.

4.1 Experiment Setup

Our network is composed of 5 fully connected layers with ReLU as the activation function. The network was trained on MNIST and CIFAR10 and optimized with SGD. The main reason MNIST and CIFAR10 were chosen is that both datasets have significantly lower proportion of noisy labels than other datasets based on the findings of Northcutt et al. (2021).

The decision of using fully connected layers instead of convolution layers is to avoid introducing additional inductive bias. As we are uncertain how those assumptions of data geometry might further introduce complexity to analysis of model behaviours.

The neural network capacity is tuned by changing the size of hidden layers, and the data complexity is probed by shuffling a portion of training labels. The fraction of labels shuffled is called “noise ratio”. The range of hidden layer sizes and noise ratio is carefully chosen to be within our computation power, while still capable of demonstrating the transition from the under-parameterized to the over-parameterized regime. These choices are optimized for MNIST; thus, applying the same range of hidden layer size and noise ratio to CIFAR10, a more complex dataset, is expected to have more models falling into the under-parameterized regime. This tendency can be observed from figures in later sections.

Finally, each experiment was executed 3 times on MNIST and 6 times on CIFAR10, and all models were trained for 10000 epochs before performing inference.

4.2 Layer Inequality

Deeper layers only start contributing when the data complexity increases, which results in the pattern shown in Figure 1. This result indicates that neural networks with relatively high capacity refrain from storing learnt information in deeper layers, and supports the aforementioned Hypothesis 1.

Meanwhile, Figure 2 reveals a clear pattern in utilization. Starting from the top left of each heatmap, where the under-parameterized regime is located, the utilization starts to increase until hitting the boundary marked by green squares (max utilization). Then, the utilization decreases as it moves toward the bottom right, where the over-parameterized regime is located. This pattern is further discussed in section 4.4.

4.3 Generalization and Resilience

In Figure 3, the contribution of generalization and resilience are shown side by side to demonstrate that both measurements follow a similar pattern, where the occurrence of max value for each noise ratio forms a diagonal line. This line marks the existence of the double descent interpolation threshold.

The first observation we found from Figure 3 is that the interpolation threshold forms a positively correlated trace between different noise ratios and neural network sizes. This trace agrees with results shown in Nakkiran et al. (2020), in which data complexity is tuned by controlling the amount of samples provided to train the model. Our experiments show that probing the noise ratio can affect trained models in a similar way to their results. This justifies the use of data complexity C_S instead of training samples $n$ in definition of EMC.

An important finding is also discovered by comparing Figure 2 and 3. The heatmaps of both utilization and generalization/resilience also follow a similar pattern, which we will shortly discuss in section 4.4.

In addition to aforementioned observations, a noticeable effect is also worth being addressed. While both generalization and resilience possess the same diagonal pattern, the bottom right half of the resilience plot, which represents a large hidden layer and low noise ratio, is significantly brighter. This effect can be explained by exploring the training procedure. In practice, the resilience is almost always contradictory to the objective function, where the training risk must be minimized empirically. When the EMC is high enough, the model will manage to fit data with shuffled labels to achieve low training risk. The forced memorization is then reflected as having relatively worse resilience when compared to under-parameterized models. At the same time, for samples that have its training labels shuffled, over-parameterized models are generally capable of “recognizing” correct labels (labels before shuffling) while catering to the objective function, allowing them to retain better resilience when compared to models near the interpolation threshold.

4.4 Utilization to Generalization and Resilience

Figure 4 compares utilization to generalization and resilience. Initially, we found that models lying on different sides of the interpolation threshold converge into two different sets of patterns in early layers, which illustrates that under- and over-parameterized regimes exhibit different characteristics. As we investigated toward deeper layers, the tail of two curves gradually merged together, which is the interpolation threshold between them. The utilization of those merged curves are positively correlated to both generalization and resilience. This finding agrees with the proposed Hypothesis 2, where utilization of deeper layers estimates EMC. Therefore, it can predict the effective complexity of a model relatively well.

4.5 Contribution and Overfit

As shown in Figure 5, while earlier layers positively contribute to the overall result, deeper layers usually have an overall negative contribution to testing loss and resilience. The result provides insight about tendencies of trained models, where early layers tend to focus on widely applicable patterns, and deeper layers often capture the remaining noise in training data.

The observed pattern can be used as an useful indicator that allows users to gauge and revert overfit of model by simply re-initializing the last few layers. Re-initialization of deeper layers can be thought of as forcing a trained model to forget memorized noises and provide regularization to a model after the training phase.

5 Discussion

In section 4, we have shown that layer utilization depends on the complexity of training samples and deeper layers only get activated when early layers are no longer capable of handling provided training data. Apart from learning orders, it is also shown that the utilization of final layers can predict generalization.

Combining aforementioned observations, we concluded that the utilization of deeper layers are a suitable estimation of EMC. It can be easily calculated, which makes it extremely useful in practice. Finally, we also provide the empirical proof that performing rolling back on final layers does in fact allow models to revert overfit behaviour. This agrees with the results of Baldock et al. (2021); Stephenson et al. (2021). While experiment results are promising, there are still several questions that are yet to be addressed.

•

Why does the diagonal utilization pattern appear in each layer respectively, and appear to be not synchronized?

While all layers are trained together, similar patterns appearing in all layers seem to suggest that each layer learns semi-independent representations from the output of previous layers, and there might not exist complex cross layer relations during optimization.
•

Why does re-initializing the layer weight result in good results?

From the perspective of PAC-Bayes theories, our results suggest that the initialization of the deeper layers in a model can be treated as a prior to the distribution of hypothesis class.

While our results can be supported by PAC-Bayes framework, it still doesn’t explain why the initialization of deeper layers needs to be respected, whereas early layers are allowed to explore the parameter space further from initial weights.

Furthermore, it is still an open problem on why the trained weights of early layers can cooperate well with the initial weights of deeper layers, and retain high performance after re-initialization.
•

Are the results universally applicable for other network structures?

All experiments in this paper were conducted on fully connected layers. It is worthwhile to explore whether other network structures, such as convolution or recurrent layers, behave the same as this paper’s results.

To answer above questions, we encourage readers to explore the two domains below:

•

The landscape illustrated by deep neural networks

A comprehensive understanding would help understand why the parameters do not stray far from initialization in the case where a 0 training risk solution clearly exists, but EMC is below data complexity.
•

How do redundant parameters in deep neural network help in learning?

We hypothesize that the redundant parameters and their initialization values help in constructing a prior that makes learning “easy” for over-parameterized settings. Additionally, the deviation from initialization in deeper layers has more critical impacts on the generalization. This hypothesis is similar to the lottery hypothesis proposed in Frankle and Carbin (2019). However, more detailed research needs to be conducted to support this proposition.

We believe our work has established an empirical basis to help developing future works in this field, and hopefully inspire more research to comprehend the generalization of deep neural networks.

Appendix A Generalization and Resilience

While using generalization and resilience metrics together is capable of providing a reliable view over how well a model learns generalizable features, the shortcomings and uncertainties of using generalization and resilience as empirical measurements will be discussed here.

A.1 Shortcomings in Empirical Measurements of Generalization and Resilience

Measuring generalization of models requires the comparison between training risks and testing risks. However, Power et al. (2021) has shown the convergence of loss on training and testing data can be highly asynchronous. To the best of our knowledge, it is currently impossible to identify whether a model has fully converged with respect to testing loss.

This causes a problem when the performance of a trained model over testing data needs to be measured. It is impossible to determine whether the results have converged to a final stable state or still at an intermediate stage due to training dynamics. Measurements of generalization suffer from this uncertainty.

On the other hand, resilience is more robust in general. Since it does not require an independent set of testing samples, but is calculated on the training samples. Thus, resilience stabilizes along with training loss, and can be measured once training converges.

A.2 Resilience and Double Descent

While double descent exists in the measurement of resilience, the second descent is significantly more obscure compared to the measurement of generalization. To understand the cause of this effect, it is required to think about the differences between generalization within the under- and over-parameterized regime.

In the under-parameterized regime, solutions found by the model can not achieve close-to-perfect results on the training data. This implies that the model will constantly focus on general patterns rather than outliers (shuffled labels). The learnt general patterns are helpful in achieving lower risks on recovered labels, and resulting in better resilience.

In contrast, close-to-perfect solutions can be found by the model in the over-parameterized regime. This implies the risk for data with shuffled labels are also minimized, and the risk of recovered labels will be high.

However, compared to critically-parameterized models which have few good solutions to pick from, and must sacrificed learning general patterns for lower training risk, the over-parameterized models have more “luxury” in exploring solutions that preserve general patterns and fit noises well. This allows over-parameterized models to still performing slightly better than models near the interpolation threshold.

References

Baldock et al. [2021] Robert John Nicholas Baldock, Hartmut Maennel, and Behnam Neyshabur. Deep learning through the lens of example difficulty. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
Belkin et al. [2019] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
Chen et al. [2021] Lin Chen, Yifei Min, Misha Belkin, and amin karbasi. Multiple descent: Design your own generalization curve. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
Dziugaite and Roy [2017] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017.
Frankle and Carbin [2019] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
Jiang et al. [2020] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2020.
McAllester [1998] David A. McAllester. Some pac-bayesian theorems. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, page 230–234, New York, NY, USA, 1998. Association for Computing Machinery.
Nakkiran et al. [2020] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2020.
Northcutt et al. [2021] Curtis G Northcutt, Anish Athalye, and Jonas Mueller. Pervasive label errors in test sets destabilize machine learning benchmarks. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
Power et al. [2021] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. In ICLR MATH-AI Workshop, 2021.
Stephenson and Lee [2021] Cory Stephenson and Tyler Lee. When and how epochwise double descent happens, 2021.
Stephenson et al. [2021] Cory Stephenson, suchismita padhy, Abhinav Ganesh, Yue Hui, Hanlin Tang, and SueYeon Chung. On the geometry of generalization and memorization in deep neural networks. In International Conference on Learning Representations, 2021.
Valiant [1984] L. G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, nov 1984.
Wilson and Izmailov [2020] Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in Neural Information Processing Systems, 2020-December, 2020.
Zhang et al. [2017] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Conference Track Proceedings, 2017.
Zhang et al. [2019] Chiyuan Zhang, Samy Bengio, and Yoram Singer. Are all layers created equal?, 2019.