Benign Overfitting in Two-layer Convolutional
Neural Networks
Abstract
Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerges a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.
1 Introduction
Modern deep learning models often consist of a huge number of model parameters, which is more than the number of training data points and therefore over-parameterized. These over-parameterized models can be trained to overfit the training data (achieving a close to training accuracy), while still making accurate prediction on the unseen test data. This phenomenon has been observed in a number of prior works (Zhang et al., 2017; Neyshabur et al., 2019), and is often referred to as benign overfitting (Bartlett et al., 2020). It revolutionizes the the classical understanding about the bias-variance trade-off in statistical learning theory, and has drawn great attention from the community (Belkin et al., 2018, 2019a, 2019b; Hastie et al., 2019).
There exist a number of works towards understanding the benign overfitting phenomenon. While they offered important insights into the benign overfitting phenomenon, most of them are limited to the settings of linear models (Belkin et al., 2019b; Bartlett et al., 2020; Hastie et al., 2019; Wu and Xu, 2020; Chatterji and Long, 2020; Zou et al., 2021b; Cao et al., 2021) and kernel/random features models (Belkin et al., 2018; Liang and Rakhlin, 2020; Montanari and Zhong, 2020), and cannot be applied to neural network models that are of greater interest. The only notable exceptions are (Adlam and Pennington, 2020; Li et al., 2021), which attempted to understand benign overfitting in neural network models. However, they are still limited to the “neural tagent kernel regime” (Jacot et al., 2018) where the neural network learning problem is essentially equivalent to kernel regression. Thus, it remains a largely open problem to show how and when benign overfitting can occur in neural networks.
Clearly, understanding benign overfitting in neural networks is much more challenging than that in linear models, kernel methods or random feature models. The foremost challenge stems from nonconvexity: previous works on linear models and kernel methods/random features are all in the convex setting, while neural network training is a highly nonconvex optimization problem. Therefore, while most of the previous works can study the minimum norm interpolators/maximum margin classifiers according to the implicit bias (Soudry et al., 2018) results for the corresponding models, existing implicit bias results for neural networks (e.g., Lyu and Li (2019)) are not sufficient and a new analysis of the neural network learning process is in demand.
In this work, we provide one such algorithmic analysis for learning two-layer convolutional neural networks (CNNs) with the second layer parameters being fixed as ’s and ’s and polynomial ReLU activation function: , where is a hyperparameter. We consider a setting where the input data consist of label dependent signals and label independent noises, and utilize a signal-noise decomposition of the CNN filters to precisely characterize the signal learning and noise memorization processes during neural network training. Our result not only demonstrates that benign overfitting can occur in learning two-layer neural networks, but also gives precise conditions under which the overfitted CNN trained by gradient descent can achieve small population loss. Our paper makes the following major contributions:
-
•
We establish population loss bounds of overfitted CNN models trained by gradient descent, and theoretically demonstrate that benign overfitting can occur in learning over-parameterized neural networks. We show that under certain conditions on the signal-to-noise ratio, CNN models trained by gradient descent will prioritize learning the signal over memorizing the noise, and thus achieving both small training and test losses. To the best of our knowledge, this is the first result on the benign overfitting of neural networks that is beyond the neural tangent kernel regime.
-
•
We also establish a negative result showing that when the conditions on the signal-to-noise ratio do not hold, then the overfitted CNN model will achieve at least a constant population loss. This result, together with our upper bound result, reveals an interesting phase transition between benign overfitting and harmful overfitting.
-
•
Our analysis is based on a new proof technique namely signal-noise decomposition, which decomposes the convolutional filters into a linear combination of initial filters, the signal vectors and the noise vectors. We convert the neural network learning into a discrete dynamical system of the coefficients from the decomposition, and perform a two-stage analysis that decouples the complicated relation among the coefficients. This enables us to analyze the non-convex optimization problem, and bound the population loss of the CNN trained by gradient descent. We believe our proof technique is of independent interest and can potentially be applied to deep neural networks.
We note that a concurrent work (Frei et al., 2022) studies learning log-Concave mixture data with label flip noise using fully-connected two-layer neural networks with smoothed leaky ReLU activation. Notably, their risk bound matches the risk bound for linear models given in Cao et al. (2021) when the label flip noise is zero. However, their analysis only focuses on upper bounding the risk, and cannot demonstrate the phase transition between benign and harmful overfitting. Compared with (Frei et al., 2022), we focus on CNNs, and consider a different data model to better capture the nature of image classification problems. Moreover, we present both positive and negative results under different SNR regimes, and demonstrate a sharp phase transition between benign and harmful overfitting.
Notation.
Given two sequences and , we denote if there exist some absolute constant and such that for all . Similarly, we denote if there exist and such that for all . We say if and both holds. We use , , and to hide logarithmic factors in these notations respectively. Moreover, we denote if for some positive constant , and if . Finally, for two scalars and , we denote .
2 Related Work
A line of recent works have attempted to understand why overfitted predictors can still achieve a good test performance. Belkin et al. (2019a) first empirically demonstrated that in many machine learning models such as random Fourier features, decision trees and ensemble methods , the population risk curve has a double descent shape with respect to the number of model parameters. Belkin et al. (2019b) further studied two specific data models, namely the Gaussian model and Fourier series model, and theoretically demonstrated the double descent risk curve in linear regression. Bartlett et al. (2020) studied over-parameterized linear regression to fit data produced by a linear model with additive noises, and established matching upper and lower bounds of the risk achieved by the minimum norm interpolator on the training dataset. It is shown that under certain conditions on the spectrum of the data covariance matrix, the population risk of the interpolator can be asymptotically optimal. Hastie et al. (2019); Wu and Xu (2020) studied linear regression in the setting where both the dimension and sample size grow together with a fixed ratio, and showed double descent of the risk with respect to this ratio. Chatterji and Long (2020) studied the population risk bounds of over-parameterized linear logistic regression on sub-Gaussian mixture models with label flipping noises, and showed how gradient descent can train over-parameterized linear models to achieve nearly optimal population risk. Cao et al. (2021) tightened the upper bound given by Chatterji and Long (2020) in the case without the label flipping noises, and established a matching lower bound of the risk achieved by over-parameterized maximum margin interpolators. Shamir (2022) proposed a generic data model for benign overfitting of linear predictors, and studied different problem settings under which benign overfitting can or cannot occur.
Besides the studies on linear models, several recent works also studied the benign overfitting and double descent phenomena in kernel methods or random feature models. Zhang et al. (2017) first pointed out that overfitting kernel predictors can sometimes still achieve good population risk. Liang and Rakhlin (2020) studied how interpolating kernel regression with radial basis function (RBF) kernels (and variants) can generalize and how the spectrum of the data covariance matrix affects the population risk of the interpolating kernel predictor. Li et al. (2021) studied the benign overfitting phenomenon of random feature models defined as two-layer neural networks whose first layer parameters are fixed at random initialization. Mei and Montanari (2019); Liao et al. (2020) demonstrated the double descent phenomenon for the population risk of interpolating random feature predictors with respect to the ratio between the dimensions of the random feature and the data input. Adlam and Pennington (2020) shows that neural tangent kernel (Jacot et al., 2018) based kernel regression has a triple descent risk curve with respect to the total number of trainable parameters. Montanari and Zhong (2020) further pointed out an interesting phase transition of the generalization error achieved by neural networks trained in the neural tangent kernel regime.
3 Problem Setup
In this section, we introduce the data generation model and the convolutional neural network we consider in this paper. We focus on binary classification, and present our data distribution in the following definition.
Definition 3.1.
Let be a fixed vector representing the signal contained in each data point. Then each data point with and is generated from the following distribution :
-
1.
The label is generated as a Rademacher random variable.
-
2.
A noise vector is generated from the Gaussian distribution .
-
3.
One of is given as , which represents the signal, the other is given by , which represents noises.
Our data generation model is inspired by image data, where the inputs consist of different patches, and only some of the patches are related to the class label of the image. In detail, the patch assigned as is the signal patch that is correlated to the label of the data, and the patch assigned as is the noise patch that is independent of the label of the data and therefore is irrelevant for prediction. We assume that the noise patch is generated from the Gaussian distribution to ensure that the noise vector is orthogonal to the signal vector for simplicity. Note that when the dimension is large, by standard concentration bounds. Therefore, we can treat as the signal-to-noise ratio. For the ease of discussion, we denote . Note that the Bayes risk for learning our model is zero. We can also add label flip noise similar to Chatterji and Long (2020); Frei et al. (2022) to make the Bayes risk equal to the label flip noise and therefore nonzero, but this will not change the key message of our paper.
Intuitively, if a classifier learns the signal and utilizes the signal patch of the data to make prediction, it can perfectly fit a given training data set and at the same time have a good performance on the test data. However, when the dimension is large (), a classifier that is a function of the noises , can also perfectly fit the training data set, while the prediction will be totally random on the new test data. Therefore, the data generation model given in Definition 3.1 is a useful model to study the population loss of overfitted classifiers. Similar models have been studied in some recent works by Li et al. (2019); Allen-Zhu and Li (2020a, b); Zou et al. (2021a).
Two-layer CNNs. We consider a two-layer convolutional neural network whose filters are applied to the two patches and separately, and the second layer parameters of the network are fixed as and respectively. Then the network can be written as , where , are defined as:
for , is the number of convolutional filters in and . Here, is the activation function where , denotes the weight for the -th filter (i.e., neuron), and is the collection of model weights associated with . We also use to denote the collection of all model weights. We note that our CNN model can also be viewed as a CNN with average global pooling (Lin et al., 2013). We train the above CNN model by minimizing the empirical cross-entropy loss function
where , and is the training data set. We further define the true loss (test loss) .
We consider gradient descent starting from Gaussian initialization, where each entry of and is sampled from a Gaussian distribution , and is the variance. The gradient descent update of the filters in the CNN can be written as
(3.1) |
for and , where we introduce a shorthand notation .
4 Main Results
In this section, we present our main theoretical results. At the core of our analyses and results is a signal-noise decomposition of the filters in the CNN trained by gradient descent. By the gradient descent update rule (3.1), it is clear that the gradient descent iterate is a linear combination of its random initialization , the signal vector and the noise vectors in the training data , . Motivated by this observation, we introduce the following definition.
Definition 4.1.
Let for , be the convolution filters of the CNN at the -th iteration of gradient descent. Then there exist unique coefficients and such that
We further denote , . Then we have that
(4.1) |
We refer to (4.1) as the signal-noise decomposition of . We add normalization factors in the definition so that . In this decomposition, characterizes the progress of learning the signal vector , and characterizes the degree of noise memorization by the filter. Evidently, based on this decomposition, for some iteration , (i) If some of ’s are large enough while are relatively small, then the CNN will have small training and test losses; (ii) If some ’s are large and all ’s are small, then the CNN will achieve a small training loss, but a large test loss. Thus, Definition 4.1 provides a handle for us to study the convergence of the training loss as well as the the population loss of the CNN trained by gradient descent.
Our results are based on the following conditions on the dimension , sample size , neural network width , learning rate , initialization scale .
Condition 4.2.
Suppose that
-
1.
Dimension is sufficiently large: .
-
2.
Training sample size and neural network width satisfy .
-
3.
The learning rate satisfies .
-
4.
The standard deviation of Gaussian initialization is appropriately chosen such that .
A few remarks on Condition 4.2 are in order. The condition on is to ensure that the learning is in a sufficiently over-parameterized setting, and similar conditions have been made in the study of learning over-parameterized linear models (Chatterji and Long, 2020; Cao et al., 2021). For example, if we choose , then the condition on becomes . Furthermore, we require the sample size and neural network width to be at least polylogarithmic in the dimension to ensure some statistical properties of the training data and weight initialization to hold with probability at least , which is a mild condition. Finally, the conditions on and are to ensure that gradient descent can effectively minimize the training loss, and they depend on the scale of the training data points. When and , the step size can be chosen as large as and the initialization can be as large as . In our paper, we only require , so our initialization and step-size can be chosen as an almost constant order. Based on these conditions, we give our main result on signal learning in the following theorem.
Theorem 4.3.
For any , let . Under Condition 4.2, if 111Here the hides an factor. This applies to Theorem 4.4 as well., then with probability at least , there exists such that:
-
1.
The CNN learns the signal: for .
-
2.
The CNN does not memorize the noises in the training data: .
-
3.
The training loss converges to , i.e., .
-
4.
The trained CNN achieves a small test loss:
Theorem 4.3 characterizes the case of signal learning. It shows that, if , then at least one CNN filter can learn the signal by achieving , and as a result, the learned neural network can achieve small training and test losses. To demonstrate the sharpness of this condition, we also present the following theorem for the noise memorization by the CNN.
Theorem 4.4.
For any , let . Under Condition 4.2, if , then with probability at least , there exists such that:
-
1.
The CNN memorizes noises in the training data: .
-
2.
The CNN does not sufficiently learn the signal: .
-
3.
The training loss converges to , i.e., .
-
4.
The trained CNN has a constant order test loss: .

Theorem 4.4 holds under the condition that . Clearly, this is the opposite regime (up to some logarithmic factors) compared with Theorem 4.3. In this case, the CNN trained by gradient descent mainly memorizes noises in the training data and does not learn enough signal. This, together with the results in Theorem 4.3, reveals a clear phase transition between signal learning and noise memorization in CNN training:
-
•
If , then the CNN learns the signal and achieves a test loss. This is the regime of benign overfitting.
-
•
If then the CNN can only memorize noises and will have a test loss. This is the regime of harmful overfitting.
The phase transition is illustrated in Figure 1. Clearly, when learning a two-layer CNN on the data generated from Definition 3.1, is the precise condition under which benign overfitting occurs. Remarkably, in this case the population loss decreases exponentially with the sample size . Under our condition that , this term can also be upper bounded by , which is small in the high-dimensional setting. Note that when and , applying standard uniform convergence based bounds (Bartlett et al., 2017; Neyshabur et al., 2018) or stability based bounds (Hardt et al., 2016; Mou et al., 2017; Chen et al., 2018) typically gives a bound on the generalization gap, which is vacuous when . Our bound under the same setting is , which is non-vacuous. This is attributed to our precise analysis of signal learning and noise memorization in Theorems 4.3 and 4.4.
Comparison with neural tangent kernel (NTK) results. We want to emphasize that our analysis is beyond the so-called neural tangent kernel regime. In the NTK regime, it has been shown that gradient descent can train an over-parameterized neural network to achieve good training and test accuracies (Jacot et al., 2018; Du et al., 2019b, a; Allen-Zhu et al., 2019b; Zou et al., 2019; Arora et al., 2019a; Cao and Gu, 2019a; Chen et al., 2019). However, it is widely believed in literature that the NTK analyses cannot fully explain the success of deep learning, as the neural networks in the NTK regime are almost “linearized” (Lee et al., 2019; Cao and Gu, 2019a). Our analysis and results are not in the NTK regime: In the NTK regime, the network parameters stay close to their initialization throughout training, i.e., , so that the NN model can be approximated by its linearization (Allen-Zhu et al., 2019b; Cao and Gu, 2019a; Chen et al., 2019). In comparison, our analysis does not rely on linearizing the neural network function, and can be as large as .
5 Overview of Proof Technique
In this section, we discuss the main challenges in the study of CNN training under our setting, and explain some key techniques we implement in our proofs to overcome these challenges. The complete proofs of all the results are given in the appendix.
Main challenges. Studying benign overfitting under our setting is a challenging task. The first challenge is the nonconvexity of the training objective function . Nonconvexity has introduced new challenges in the study of benign overfitting particularly because our goal is not only to show the convergence of the training loss, but also to study the population loss in the over-parameterized setting, which requires a precise algorithmic analysis of the learning problem.
5.1 Iterative Analysis of the Signal-Noise Decomposition
In order to study the learning process based on the nonconvex optimization problem, we propose a key technique which enables the iterative analysis of the coefficients in the signal-noise decomposition in Definition 4.1. This technique is given in the following lemma.
Lemma 5.1.
The coefficients in Definition 4.1 satisfy the following equations:
(5.1) | |||
(5.2) | |||
(5.3) | |||
(5.4) |
Remark 5.2.
With the decomposition (4.1), the signal learning and noise memorization processes of a CNN can be formally studied by analyzing the dynamics of based on the dynamical system (5.2)-(5.4). Note that prior to our work, several existing results have utilized the inner products during the neural network training process in order to establish generalization bounds (Brutzkus et al., 2018; Chatterji and Long, 2020; Frei et al., 2021). Similar inner product based arguments are also implemented in Allen-Zhu and Li (2020a, b); Zou et al. (2021a), which study different topics related to learning neural networks. Compared with the inner product based argument, our method has two major advantages: (i) Based on the definition (5.2)-(5.4) and the fact that , it is clear that are monotonically increasing, while is monotonically decreasing throughout the whole training process. In comparison, monotonicity does not hold in the inner product based argument, especially for . (ii) Our signal-noise decomposition also enables a clean homogeneity-based proof for the convergence of the training loss to an arbitrarily small error rate , which will be presented in Subsection 5.2.
With Lemma 5.1, we can reduce the study of the CNN learning process to the analysis of the discrete dynamical system given by (5.1)-(5.4). Our proof then focuses on a careful assessment of the values of the coefficients throughout training. To prepare for more detailed analyses, we first present the following bounds of the coefficients, which hold throughout training.
Proposition 5.3.
Under Condition 4.2, for any , the following bounds hold for :
-
•
for all , and .
-
•
for all , and .
We can then prove the following lemma, which demonstrates that the training objective function can dominate the gradient norm along the gradient descent path.
Lemma 5.4.
Under Condition 4.2, for any , the following result holds for :
Lemma 5.4 plays a key role in the convergence proof of training loss function. However, note that our study of benign overfitting requires carefully monitoring the changes of the coefficients in the signal-noise decomposition, which cannot be directly done by Lemma 5.4. This is quite a challenging task, due to the complicated interactions among , and . Note that even , which has the simplest formula (5.2), depends on all the quantities , and for , and . This is because the cross-entropy loss derivative term depends on all the neurons of the network. To overcome this challenge, we introduce in the next subsection a decoupling technique based on a two-stage analysis.
5.2 Decoupling with a Two-Stage Analysis.
We utilize a two-stage analysis to decouple the complicated relation among the coefficients , and . Intuitively, the initial neural network weights are small enough so that the neural network at initialization has constant level cross-entropy loss derivatives on all the training data: for all . This is guaranteed under Condition 4.2 and matches neural network training in practice. Motivated by this, we can consider the first stage of the training process where , in which case we can show significant scale differences among , and . Based on the result in the first stage, we then proceed to the second stage of the training process where the loss derivatives are no longer at a constant level and show that the training loss can be optimized to be arbitrarily small and meanwhile, the scale differences shown in the first learning stage remain the same throughout the training process. In the following, we focus on explaining the key proof steps for Theorem 4.3. The proof idea for Theorem 4.4 is similar, so we defer the details to the appendix.
Stage 1. It can be shown that, until some of the coefficients , reach , we have for all . Therefore, we first focus on this first stage of the training process, where the dynamics of the coefficients in (5.2) - (5.4) can be greatly simplified by replacing the factors by their constant upper and lower bounds. The following lemma summarizes our main conclusion at stage 1 for signal learning:
Lemma 5.5.
Under the same conditions as Theorem 4.3, there exists such that
-
•
for .
-
•
for all , , and .
Lemmas 5.5 takes advantage of the training period when the loss function derivatives remain a constant order to show that the CNN can capture the signal. At the end of stage 1 in signal learning, reaches , and is significantly larger than . After this, it is no longer guaranteed that the loss derivatives will remain constant order, and thus starts the training stage 2.
Stage 2. In this stage, we take into full consideration the exact definition and show that the training loss function will converge to . Thanks to the analysis in stage 1, we know that some is significantly larger than all ’s at the end of stage 1. This scale difference is the key to our analysis in stage 2. Based on this scale difference and the monotonicity of , , in the signal-noise decomposition, it can be shown that there exists such that throughout stage 2. Moreover, since the neural network is -homogeneous in , we have . Therefore,
where the second inequality follows by the convexity of the cross-entropy loss function. With the above key technique, we can prove the following lemma.
Lemma 5.6.
Lemma 5.6 states two main results on signal learning. First of all, during this training period, it is guaranteed that the coefficients of noise vectors in the signal-noise decomposition remain sufficiently small. Moreover, it also gives an optimization type result that the best iterate in is small as long as is large enough.
Clearly, the convergence of the training loss stated in Theorems 4.3 directly follows by choosing to be sufficiently large in Lemmas 5.6. The lemma below further gives an upper bound on the test loss.
Lemma 5.7.
Below we finalize the proof of Theorem 4.3. The proofs of other results are in the appendix.
Proof of Theorem 4.3.
The first part of Theorem 4.3 follows by Lemma 5.5 and the monotonicity of . The second part of Theorem 4.3 follows by Lemma 5.6. For the third part, let be defined in Lemma 5.6. Then by the definition of , we have
where the first inequality is by triangle inequality, the second inequality is by the signal-noise decomposition of and the definition of , and the last equality is by Proposition 5.3 and Lemma 5.5. Therefore, choosing in Lemma 5.6 ensures that
and there exists such that . This completes the proof of the third part of Theorem 4.3. Finally, combining this bound with Lemma 5.7 gives
which proves the last part of Theorem 4.3. ∎
6 Conclusion and Future Work
This paper utilizes a signal-noise decomposition to study the signal learning and noise memorization process in the training of a two-layer CNN. We precisely give the conditions under which the CNN will mainly focus on learning signals or memorizing noises, and reveals a phase transition of the population loss with respect to the sample size, signal strength, noise level, and dimension. Our result theoretically demonstrates that benign overfitting can happen in neural network training. An important future work direction is to study the benign overfitting phenomenon of neural networks in learning other data models. Moreover, it is also important to generalize our analysis to deep convolutional neural networks.
Acknowledgements
We would like to thank Spencer Frei for valuable comment and discussion on the earlier version of this paper, and pointing out a related work.
Appendix A Additional Related Work
There has also been a large number of works studying the optimization and generalization of neural networks. A series of work (Li and Yuan, 2017; Soltanolkotabi, 2017; Du et al., 2018a, b; Zhong et al., 2017; Zhang et al., 2019; Cao and Gu, 2019b) studied the parameter recovery problem in two-layer neural networks, where the data are given by a teacher network and the task is to recover the parameters in the teacher network. These works either focus on the noiseless setting, or requires the number of training data points to be larger than the number of parameters in the network, and therefore does not cover the setting where the neural network can overfit the training data. Another line of works (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2018; Golowich et al., 2018; Arora et al., 2018) have studied the generalization gap between the training and test losses of neural networks with uniform convergence based arguments. However, these results are not algorithm-dependent and cannot explain benign overfitting. Some recent works studied the generalization gap based on stability based arguments (Bousquet and Elisseeff, 2002; Hardt et al., 2016; Mou et al., 2017; Chen et al., 2018). A more recent line of works studied the convergence (Jacot et al., 2018; Li and Liang, 2018; Du et al., 2019b; Allen-Zhu et al., 2019b; Du et al., 2019a; Zou et al., 2019) and test error bounds (Allen-Zhu et al., 2019a; Arora et al., 2019a, b; Cao and Gu, 2019a; Ji and Telgarsky, 2020; Chen et al., 2019) of over-parameterized networks in the neural tangent kernel regime. However, these works depend on the equivalence between neural network training and kernel methods, which cannot fully explain the success of deep learning. Compared with the works mentioned above, our work has a different focus which is to study the conditions for benign and harmful overfitting.
Appendix B Preliminary Lemmas
In this section, we present some pivotal lemmas that give some important properties of the data and the neural network parameters at their random initialization.
Lemma B.1.
Suppose that and . Then with probability at least ,
Proof of Lemma B.1.
By Hoeffding’s inequality, with probability at least , we have
Therefore, as long as , we have
This proves the result for . The proof for is exactly the same, and we can conclude the proof by applying a union bound. ∎
The following lemma estimates the norms of the noise vectors , , and gives an upper bound of their inner products with each other.
Lemma B.2.
Suppose that and . Then with probability at least ,
for all .
Proof of Lemma B.2.
By Bernstein’s inequality, with probability at least we have
Therefore, as long as , we have
Moreover, clearly has mean zero. For any with , by Bernstein’s inequality, with probability at least we have
Applying a union bound completes the proof. ∎
The following lemma studies the inner product between a randomly initialized CNN convolutional filter , and and the signal/noise vectors in the training data. The calculations characterize how the neural network at initialization randomly captures signal and noise information.
Lemma B.3.
Suppose that , . Then with probability at least ,
for all , and . Moreover,
for all and .
Proof of Lemma B.3.
It is clear that for each , is a Gaussian random variable with mean zero and variance . Therefore, by Gaussian tail bound and union bound, with probability at least ,
Moreover, is an absolute constant, and therefore by the condition on , we have
By Lemma B.2, with probability at least , for all . Therefore, the result for follows the same proof as . ∎
Appendix C Signal-noise Decomposition Analysis
In this section, we establish a series of results on the signal-noise decomposition. These results are based on the conclusions in Section B, which hold with high probability. Denote by the event that all the results in Section B hold. Then for simplicity and clarity, we state all the results in this and the following sections conditional on .
Lemma C.1 (Restatement of Lemma 5.1).
The coefficients defined in Definition 4.1 satisfy the following iterative equations:
for all , and .
Proof of Lemma C.1.
By our data model in Definition 3.1 and Gaussian initialization of the CNN weights, it is clear that with probability , the vectors are linearly independent. Therefore, the decomposition (4.1) is unique. Now consider and
It is then easy to check by (3.1) that
Hence by the uniqueness of the decomposition we have and . Then we have that
Moreover, note that by the definition of the cross-entropy loss. Therefore,
(C.1) | |||
(C.2) |
Writing out the iterative versions of (C.1) and (C.2) completes the proof. ∎
We can futher plug the signal-noise decomposition (4.1) into the iterative formulas in Lemma C.1. By the second equation in Lemma C.1, we have
(C.3) |
Moreover, by the third equation in Lemma C.1, we have
(C.4) |
if , and for all if . Similarly, by the last equation in Lemma C.1, we have
(C.5) |
if , and for all if .
We will now show that the parameter of the signal-noise decomposition will stay a reasonable scale during a long time of training. Let us consider the learning period , where is the maximum admissible iterations. Note that we can consider any polynomial training time . Denote . Here we list the exact conditions for required by the proofs in this section, which are part of Condition 4.2:
(C.6) | |||
(C.7) | |||
(C.8) |
Denote . By Lemma B.3, with probability at least , we can upper bound by . Then, by (C.7) and (C.8), it is straightforward to verify the following inequality:
(C.9) |
Suppose the conditions listed in (C.6), (C.7) and (C.8) hold, we claim that for the following property holds.
Proposition C.2 (Restatement of Proposition 5.3).
We will use induction to prove Proposition C.2. We first introduce several technical lemmas that will be used for the proof of Proposition C.2.
Lemma C.3.
For any , it holds that for all , .
Proof of Lemma C.3.
For any time , we have that
where the equation is by our orthogonal assumption. ∎
Proof of Lemma C.4.
Proof of Lemma C.5.
Lemma C.6.
Proof of Lemma C.6.
Now we are ready to prove Proposition C.2.
Proof of Proposition C.2.
Our proof is based on induction. The results are obvious at as all the coefficients are zero. Suppose that there exists such that the results in Proposition C.2 hold for all time . We aim to prove that they also hold for .
We first prove that (C.11) holds for , i.e., for , , and . Notice that . Therefore, we only need to consider the case that . When , by Lemma C.4 we have that
and thus
where the last inequality is by induction hypothesis. When , we have that
where we use and in the first inequality, the second inequality is by , and the last inequality is by in (C.6).
Next we prove (C.10) holds for . We have
(C.16) |
where the last inequality is due to Lemma C.5. Moreover, recall the update rule of and ,
Let to be the last time that . Then we have that
(C.17) |
We first bound as follows,
where the first inequality is by Lemmas C.4 and B.2, the second inequality is by and , the last inequality is by .
Second, we bound . For and , we can lower bound as follows,
where the first inequality is by Lemma C.4, the second inequality is by and due to the definition of and , the last inequality is by and . Similarly, for and , we can also upper bound as follows,
where the first inequality is by Lemma C.4, the second inequality is by induction hypothesis , the last inequality is by and . Thus, plugging the upper and lower bounds of into gives
where the first inequality is by (C.16), the second inequality is by Lemma B.2, the third inequality is by in (C.6), the fourth inequality is by our choice of and the last inequality is due to the fact that . Plugging the bound of into (C.17) completes the proof for . Similarly, we can prove that using in (C.6). Therefore Proposition C.2 holds for , which completes the induction. ∎
Based on Proposition C.2, we introduce some important properties of the training loss function for .
Proof of Lemma C.7.
We first prove that
(C.18) |
Without loss of generality, we suppose that and . Then we have that
where the first and second inequalities are by triangle inequality, the third inequality is by Jensen’s inequality and Lemma B.2, and the last inequality is due to Lemma C.5. Denote . Then we have that , and besides, by Lemma C.5. Then we have that
where (i) is by because has an exponentially decaying tail. Now we can upper bound the gradient norm as follows,
where the first inequality is by triangle inequality, the second inequality is by (C.18), the third inequality is by Cauchy-Schwartz inequality and the last inequality is due to the property of the cross entropy loss . ∎
Appendix D Signal Learning
In this section, we consider the signal learning case under the condition that . We remind the readers that the proofs in this section are based on the results in Section B, which hold with high probability.
D.1 First stage
Lemma D.1 (Restatement of Lemma 5.5).
Under the same conditions as Theorem 4.3, in particular if we choose
(D.1) |
where is a positive constant, there exists time
such that
-
•
for .
-
•
for all , and .
Proof of Lemma D.1.
Let
(D.2) |
We first prove the second bullet. Define . We use induction to show that
(D.3) |
for all . By definition, clearly we have . Now suppose that there exists some such that (D.3) holds for . Then by (C.4) and (C.5) we have
where the second inequality is by , the third inequality is due to Lemmas B.2 and B.3, the fourth inequality follows by the condition that , and the last inequality follows by the induction hypothesis (D.3). Taking a telescoping sum over then gives
where the second inequality follows by in our induction hypothesis. Therefore, by induction, we have for all .
Now, without loss of generality, let us consider first. Denote by the last time for in the period satisfying that . Then for , and . Therefore, by Lemmas C.5 and C.6, we know that for all with . Thus there exists a positive constant such that for all with .
By (C.3), for we have
Denote and let . Then we have
where the second inequality is by the lower bound on the number of positive data in Lemma B.1, the third inequality is due to the fact that is an increasing sequence, and the last inequality follows by proved in Lemma B.3. Therefore, the sequence will exponentially grow and we have that
where the second inequality is due to the fact that for and our condition of and listed in Condition 4.2, and the last inequality follows by Lemma B.3 and . Therefore, will reach within iterations. Since , will reach within iterations. We can next verify that
where the inequality holds due to our SNR condition in (D.1). Therefore, by the definition of , we have , where we use the non-decreasing property of . The proof for is similar, and we can prove that while , which completes the proof. ∎
D.2 Second Stage
By the results we get in the first stage we know that
And at the beginning of the second stage, we have following property holds:
-
•
.
-
•
where .
Lemma 5.1 implies that the learned feature will not get worse, i.e., for , we have that , and therefore . Now we choose as follows:
Based on the above definition of , we have the following lemma.
Lemma D.2.
Under the same conditions as Theorem 4.4, we have that .
Proof of Lemma D.2.
Lemma D.3.
Under the same conditions as Theorem 4.3, we have that for all and .
Proof of Lemma D.3.
Recall that , so we have
(D.4) |
where the inequality is by Lemma B.3. Next we will bound the inner-product terms in (D.4) respectively. By Lemma C.6, we have that for
(D.5) |
We can also get the upper bound of the inner products between the parameter and the signal (noise) as follows,
(D.6) |
where (i) is by Lemma C.3, (iii) is by Lemma C.4, (ii) and (iv) are due to Proposition C.2. Plugging (D.5) and (D.6) into (D.4) gives,
where the last inequality is by in Condition 4.2. This completes the proof. ∎
Lemma D.4.
Proof of Lemma D.4.
We first apply a proof technique similar to Lemma 2.6 in Ji and Telgarsky (2020). The difference between our analysis and Ji and Telgarsky (2020) is that here the neural network is homogeneous rather than 1 homogeneous.
where the first inequality is by Lemma D.3, the second inequality is due to the convexity of the cross entropy function, and the last inequality is due to Lemma C.7. ∎
Lemma D.5 (Restatement of Lemma 5.6).
Under the same conditions as Theorem 4.3, let . Then we have for all . Besides,
for all , and we can find an iterate with training loss smaller than within iterations.
Proof of Lemma D.5.
By Lemma D.4, for any , we have that
holds for . Taking a summation, we obtain that
(D.7) |
for all . Dividing on both side of (D.7) gives that
Then we can take and have that
where we use the fact that and our choice that . Because the mean is smaller than , we can conclude that there exist such that .
Finally, we will prove that for all . Plugging into (D.7) gives that
(D.8) |
where the inequality is due to in Lemma D.2. Define . We will use induction to prove for all . At , by the definition of , clearly we have . Now suppose that there exists such that for all . Then for , by (C.4) and (C.5) we have
where the second inequality is due to Lemmas B.2 and B.3, and the last inequality follows by the assumption that . Taking a telescoping sum over , we have that
where (i) is by out induction hypothesis that , (ii) is by , (iii) is by , (iv) is due to in (D.8), (v) is by , and (vi) is by the definition that and by Condition 4.2. Therefore, , which completes the induction. ∎
D.3 Population Loss
Consider a new data point drawn from the distribution defined in Definition 3.1. Without loss of generality, we suppose that the first patch is the signal patch and the second patch is the noise patch, i.e., . Moreover, by the signal-noise decomposition, the learned neural network has parameter
for and .
Lemma D.6.
Under the same conditions as Theorem 4.3, we have that for all .
Proof.
Lemma D.7.
Under the same conditions as Theorem 4.3, with probability at least , we have that for all , where .
Proof of Lemma D.7.
By (D.9), , where . Clearly is a Gaussian distribution with mean zero and standard deviation smaller than . Therefore, the probability is bounded by
Applying a union bound over completes the proof. ∎
Lemma D.8 (Restatement of Lemma 5.7).
Proof of Lemma D.8.
Let event to be the event that Lemma D.7 holds. Then we can divide into two parts:
(D.10) |
In the following, we bound and respectively.
Bounding : Since , there must exist one such that , which implies that . Therefore, we have that
(D.11) |
where (i) is by . If event holds, we have that
(D.12) |
where the second inequality is by in Lemma D.7 and in Lemma D.6. Thus we have that
where the first inequality is by the property of cross-entropy loss that for all , the second inequality is by (D.12), and the third inequality is by (D.11). Dropping the event in the expectation gives .
Bounding : Next we bound the second term . We choose an arbitrary training data such that . Then we have
(D.13) |
where the first inequality is due to , the second inequality is by the property of cross-entropy loss, i.e., for all , the third inequality is by , the fourth inequality is by in Lemma C.5, and the last inequality is due to in (D.9). Then we further have that
where the first inequality is by Cauchy-Schwartz inequality, the second inequality is by (D.13), the third inequality is by Lemma D.7 and the fact that , and the last inequality is by our condition in Condition 4.2. Plugging the bounds of , into (D.10) completes the proof. ∎
Appendix E Noise Memorization
In this section, we will consider the noise memorization case under the condition that . We remind the readers that the proofs in this section are based on the results in Section B, which hold with high probability.
We also remind readers that is defined in Appendix C. Denote . The following lemma provides a lower bound of .
Lemma E.1.
Proof of Lemma E.1.
E.1 First Stage
Lemma E.2.
Under the same conditions as Theorem 4.4, in particular if we choose
(E.2) |
where is a positive constant, then there exist
such that
-
•
for all .
-
•
for all .
-
•
for all .
Proof of Lemma E.2.
Let
(E.3) |
By Proposition C.2, we have that for all , , and . Since and , we have that . Next, we will carefully compute the growth of the .
where the inequality is by . Let , then we have that
(E.4) |
We will use induction to prove that for . By definition, clearly we have that . Now suppose that there exists some such that holds for . Taking a telescoping sum of (E.4) gives that
where the second inequality is by our induction hypothesis, the third inequality is by in Lemma B.3, and the last inequality is by (E.3). Thus we have that for all . Therefore, for all . Recall that
For , Lemma C.4 implies that
where the last inequality is by . Now let . For each , denote by the last time in the period satisfying that . Then for , and . Therefore, by Lemmas C.5 and C.6, we know that . Thus there exists a positive constant such that for all . It is also easy to check that . Then we can carefully compute the growth of ,
where the second inequality is by the non-decreasing property of . Therefore, is an exponentially increasing sequence and we have that
where the second inequality is due to the fact that for and our conditions of and listed in Condition 4.2, and the last inequality is due to . Therefore, will reach within iterations. Since , will reach within iterations. We can next verify that
where the inequality holds due to our SNR condition in (E.2). Therefore, by the definition of , we have , where we use the non-decreasing property of . This completes the proof. ∎
E.2 Second Stage
By the signal-noise decompositon, at the end of the first stage, we have
for and . By the results we get in the first stage, we know that at the beginning of this stage, we have following property holds:
-
•
for all .
-
•
.
-
•
, where .
Note that Lemma 5.1 implies that the learned noise will not decrease, i.e., . Therefore, for all data index , we have for all . Now we choose as follows
Based on the definition of , we have the following lemma.
Lemma E.3.
Under the same conditions as Theorem 4.4, we have that .
Proof of Lemma E.3.
Lemma E.4.
Proof of Lemma E.4.
Recall that , so we have
(E.5) |
where the first inequality is by Lemma B.3 and the last inequality is by Lemma B.2. Next we will bound the inner-product terms in (D.4) respectively. By Lemma C.6, we have that
(E.6) |
where the last inequality is by Proposition C.2.
For , we can bound the inner product between the parameter and the noise as follows
(E.7) |
where the first inequality is by Lemma C.4, the second inequality is by Lemma E.2.
For , we can bound the inner product between the parameter and the noise as follows
(E.8) |
where the first inequality is by Lemma C.5 and the last inequality is by Lemma B.3 and the conditions of and in Condition 4.2. Therefore, plugging (E.6), (E.7), (E.8) into (E.5) gives
where the last inequality is by and in Condition 4.2. ∎
Lemma E.5.
Proof of Lemma E.5.
Lemma E.6.
Under the same conditions as Theorem 4.4, let . Then we have , for all . Besides,
for all , and we can find an iterate with training loss smaller than within iterations.
Proof of Lemma E.6.
By Lemma E.5, for any , we obtain that
(E.9) |
holds for . Taking a summation, we have that
(E.10) |
where (i) is by and (ii) is by Lemma E.3 Then we can use induction to prove that for all . Clearly, by the definition of , we have . Now suppose that there exists such that for all . Then by (C.3), we have
for all and , where (i) is by induction hypothesis , (ii) is by , (iii) is by (E.10), (iv) is by , and is by and by Condition 4.2. Therefore, we have , which completes the induction. ∎
E.3 Population Loss
Lemma E.7 (4th statement of Theorem 4.4).
Under the same conditions as Theorem 4.4, within iterations, we can find such that . Besides, for any we have that .
Proof of Lemma E.7.
Given a new example , we have that
where (i) is by triangle inequality and (ii) is by in Lemma E.6 and in Proposition 5.3.
Therefore, we have that . So with probability ,
Since the signal vector is orthogonal to noises, by in Lemma E.6, we also have that . Now by union bound, with probability at least , we have that
where the last inequality is by and in Condition 4.2. Therefore, with probability at least , we have that
Thus . This completes the proof. ∎
References
- Adlam and Pennington (2020) Adlam, B. and Pennington, J. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. PMLR.
- Allen-Zhu and Li (2020a) Allen-Zhu, Z. and Li, Y. (2020a). Feature purification: How adversarial training performs robust deep learning. arXiv preprint arXiv:2005.10190 .
- Allen-Zhu and Li (2020b) Allen-Zhu, Z. and Li, Y. (2020b). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 .
- Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2019a). Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems.
- Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y. and Song, Z. (2019b). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
- Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z. and Wang, R. (2019a). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning.
- Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. and Wang, R. (2019b). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
- Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B. and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning.
- Bartlett et al. (2017) Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems.
- Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences .
- Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116 15849–15854.
- Belkin et al. (2019b) Belkin, M., Hsu, D. and Xu, J. (2019b). Two models of double descent for weak features. arXiv preprint arXiv:1903.07571 .
- Belkin et al. (2018) Belkin, M., Ma, S. and Mandal, S. (2018). To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning.
- Bousquet and Elisseeff (2002) Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of machine learning research 2 499–526.
- Brutzkus et al. (2018) Brutzkus, A., Globerson, A., Malach, E. and Shalev-Shwartz, S. (2018). Sgd learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations.
- Cao and Gu (2019a) Cao, Y. and Gu, Q. (2019a). Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems.
- Cao and Gu (2019b) Cao, Y. and Gu, Q. (2019b). Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems.
- Cao et al. (2021) Cao, Y., Gu, Q. and Belkin, M. (2021). Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. Advances in Neural Information Processing Systems 34.
- Chatterji and Long (2020) Chatterji, N. S. and Long, P. M. (2020). Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. arXiv preprint arXiv:2004.12019 .
- Chen et al. (2018) Chen, Y., Jin, C. and Yu, B. (2018). Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619 .
- Chen et al. (2019) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2019). How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360 .
- Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
- Du et al. (2018a) Du, S. S., Lee, J. D. and Tian, Y. (2018a). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
- Du et al. (2018b) Du, S. S., Lee, J. D., Tian, Y., Singh, A. and Poczos, B. (2018b). Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In International Conference on Machine Learning.
- Du et al. (2019b) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
- Frei et al. (2021) Frei, S., Cao, Y. and Gu, Q. (2021). Provable generalization of sgd-trained neural networks of any width in the presence of adversarial label noise. In International Conference on Machine Learning. PMLR.
- Frei et al. (2022) Frei, S., Chatterji, N. S. and Bartlett, P. L. (2022). Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. arXiv preprint arXiv:2202.05928 .
- Golowich et al. (2018) Golowich, N., Rakhlin, A. and Shamir, O. (2018). Size-independent sample complexity of neural networks. In Conference On Learning Theory.
- Hardt et al. (2016) Hardt, M., Recht, B. and Singer, Y. (2016). Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR. org.
- Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 .
- Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
- Ji and Telgarsky (2020) Ji, Z. and Telgarsky, M. (2020). Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations.
- Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems.
- Li and Liang (2018) Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems.
- Li et al. (2019) Li, Y., Wei, C. and Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems.
- Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems.
- Li et al. (2021) Li, Z., Zhou, Z.-H. and Gretton, A. (2021). Towards an understanding of benign overfitting in neural networks. arXiv preprint arXiv:2106.03212 .
- Liang and Rakhlin (2020) Liang, T. and Rakhlin, A. (2020). Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics 48 1329–1347.
- Liao et al. (2020) Liao, Z., Couillet, R. and Mahoney, M. (2020). A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020).
- Lin et al. (2013) Lin, M., Chen, Q. and Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400 .
- Lyu and Li (2019) Lyu, K. and Li, J. (2019). Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890 .
- Mei and Montanari (2019) Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355 .
- Montanari and Zhong (2020) Montanari, A. and Zhong, Y. (2020). The interpolation phase transition in neural networks: Memorization and generalization under lazy training. arXiv preprint arXiv:2007.12826 .
- Mou et al. (2017) Mou, W., Wang, L., Zhai, X. and Zheng, K. (2017). Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947 .
- Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., McAllester, D. and Srebro, N. (2018). A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representation.
- Neyshabur et al. (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. and Srebro, N. (2019). Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations.
- Neyshabur et al. (2015) Neyshabur, B., Tomioka, R. and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory.
- Shamir (2022) Shamir, O. (2022). The implicit bias of benign overfitting. arXiv preprint arXiv:2201.11489 .
- Soltanolkotabi (2017) Soltanolkotabi, M. (2017). Learning ReLUs via gradient descent. In Advances in Neural Information Processing Systems.
- Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 2822–2878.
- Wu and Xu (2020) Wu, D. and Xu, J. (2020). On the optimal weighted regularization in overparameterized linear regression. Advances in Neural Information Processing Systems 33.
- Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
- Zhang et al. (2019) Zhang, X., Yu, Y., Wang, L. and Gu, Q. (2019). Learning one-hidden-layer ReLU networks via gradient descent. In The 22nd International Conference on Artificial Intelligence and Statistics.
- Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L. and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. In International Conference on Machine Learning.
- Zou et al. (2021a) Zou, D., Cao, Y., Li, Y. and Gu, Q. (2021a). Understanding the generalization of adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371 .
- Zou et al. (2019) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2019). Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning .
- Zou et al. (2021b) Zou, D., Wu, J., Braverman, V., Gu, Q. and Kakade, S. (2021b). Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory. PMLR.