Benign Overfitting in Two-layer Convolutional
Neural Networks

Yuan Cao and Zixiang Chen¹¹footnotemark: 1 and Mikhail Belkin and Quanquan Gu Equal contributionDepartment of Statistics and Actuarial Science and Department of Mathematics, The University of Hong Kong, Hong Kong; e-mail: [email protected]Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]Halicioğlu Data Science Institute, University of California, San Diego, La Jolla, CA 92093, USA; e-mail: [email protected]Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]

Abstract

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerges a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.

1 Introduction

Modern deep learning models often consist of a huge number of model parameters, which is more than the number of training data points and therefore over-parameterized. These over-parameterized models can be trained to overfit the training data (achieving a close to $100\%$ training accuracy), while still making accurate prediction on the unseen test data. This phenomenon has been observed in a number of prior works (Zhang et al., 2017; Neyshabur et al., 2019), and is often referred to as benign overfitting (Bartlett et al., 2020). It revolutionizes the the classical understanding about the bias-variance trade-off in statistical learning theory, and has drawn great attention from the community (Belkin et al., 2018, 2019a, 2019b; Hastie et al., 2019).

There exist a number of works towards understanding the benign overfitting phenomenon. While they offered important insights into the benign overfitting phenomenon, most of them are limited to the settings of linear models (Belkin et al., 2019b; Bartlett et al., 2020; Hastie et al., 2019; Wu and Xu, 2020; Chatterji and Long, 2020; Zou et al., 2021b; Cao et al., 2021) and kernel/random features models (Belkin et al., 2018; Liang and Rakhlin, 2020; Montanari and Zhong, 2020), and cannot be applied to neural network models that are of greater interest. The only notable exceptions are (Adlam and Pennington, 2020; Li et al., 2021), which attempted to understand benign overfitting in neural network models. However, they are still limited to the “neural tagent kernel regime” (Jacot et al., 2018) where the neural network learning problem is essentially equivalent to kernel regression. Thus, it remains a largely open problem to show how and when benign overfitting can occur in neural networks.

Clearly, understanding benign overfitting in neural networks is much more challenging than that in linear models, kernel methods or random feature models. The foremost challenge stems from nonconvexity: previous works on linear models and kernel methods/random features are all in the convex setting, while neural network training is a highly nonconvex optimization problem. Therefore, while most of the previous works can study the minimum norm interpolators/maximum margin classifiers according to the implicit bias (Soudry et al., 2018) results for the corresponding models, existing implicit bias results for neural networks (e.g., Lyu and Li (2019)) are not sufficient and a new analysis of the neural network learning process is in demand.

In this work, we provide one such algorithmic analysis for learning two-layer convolutional neural networks (CNNs) with the second layer parameters being fixed as $+1$ ’s and $-1$ ’s and polynomial ReLU activation function: $\sigma(z)=\max\{0,z\}^{q}$ , where $q>2$ is a hyperparameter. We consider a setting where the input data consist of label dependent signals and label independent noises, and utilize a signal-noise decomposition of the CNN filters to precisely characterize the signal learning and noise memorization processes during neural network training. Our result not only demonstrates that benign overfitting can occur in learning two-layer neural networks, but also gives precise conditions under which the overfitted CNN trained by gradient descent can achieve small population loss. Our paper makes the following major contributions:

•

We establish population loss bounds of overfitted CNN models trained by gradient descent, and theoretically demonstrate that benign overfitting can occur in learning over-parameterized neural networks. We show that under certain conditions on the signal-to-noise ratio, CNN models trained by gradient descent will prioritize learning the signal over memorizing the noise, and thus achieving both small training and test losses. To the best of our knowledge, this is the first result on the benign overfitting of neural networks that is beyond the neural tangent kernel regime.
•

We also establish a negative result showing that when the conditions on the signal-to-noise ratio do not hold, then the overfitted CNN model will achieve at least a constant population loss. This result, together with our upper bound result, reveals an interesting phase transition between benign overfitting and harmful overfitting.
•

Our analysis is based on a new proof technique namely signal-noise decomposition, which decomposes the convolutional filters into a linear combination of initial filters, the signal vectors and the noise vectors. We convert the neural network learning into a discrete dynamical system of the coefficients from the decomposition, and perform a two-stage analysis that decouples the complicated relation among the coefficients. This enables us to analyze the non-convex optimization problem, and bound the population loss of the CNN trained by gradient descent. We believe our proof technique is of independent interest and can potentially be applied to deep neural networks.

We note that a concurrent work (Frei et al., 2022) studies learning log-Concave mixture data with label flip noise using fully-connected two-layer neural networks with smoothed leaky ReLU activation. Notably, their risk bound matches the risk bound for linear models given in Cao et al. (2021) when the label flip noise is zero. However, their analysis only focuses on upper bounding the risk, and cannot demonstrate the phase transition between benign and harmful overfitting. Compared with (Frei et al., 2022), we focus on CNNs, and consider a different data model to better capture the nature of image classification problems. Moreover, we present both positive and negative results under different SNR regimes, and demonstrate a sharp phase transition between benign and harmful overfitting.

Notation.

Given two sequences $\{x_{n}\}$ and $\{y_{n}\}$ , we denote $x_{n}=O(y_{n})$ if there exist some absolute constant $C_{1}>0$ and $N>0$ such that $|x_{n}|\leq C_{1}|y_{n}|$ for all $n\geq N$ . Similarly, we denote $x_{n}=\Omega(y_{n})$ if there exist $C_{2}>0$ and $N>0$ such that $|x_{n}|\geq C_{2}|y_{n}|$ for all $n>N$ . We say $x_{n}=\Theta(y_{n})$ if $x_{n}=O(y_{n})$ and $x_{n}=\Omega(y_{n})$ both holds. We use $\widetilde{O}(\cdot)$ , $\widetilde{\Omega}(\cdot)$ , and $\widetilde{\Theta}(\cdot)$ to hide logarithmic factors in these notations respectively. Moreover, we denote $x_{n}=\mathrm{poly}(y_{n})$ if $x_{n}=O(y_{n}^{D})$ for some positive constant $D$ , and $x_{n}=\operatorname{\rm polylog}(y_{n})$ if $x_{n}=\mathrm{poly}(\log(y_{n}))$ . Finally, for two scalars $a$ and $b$ , we denote $a\vee b=\max\{a,b\}$ .

2 Related Work

A line of recent works have attempted to understand why overfitted predictors can still achieve a good test performance. Belkin et al. (2019a) first empirically demonstrated that in many machine learning models such as random Fourier features, decision trees and ensemble methods , the population risk curve has a double descent shape with respect to the number of model parameters. Belkin et al. (2019b) further studied two specific data models, namely the Gaussian model and Fourier series model, and theoretically demonstrated the double descent risk curve in linear regression. Bartlett et al. (2020) studied over-parameterized linear regression to fit data produced by a linear model with additive noises, and established matching upper and lower bounds of the risk achieved by the minimum norm interpolator on the training dataset. It is shown that under certain conditions on the spectrum of the data covariance matrix, the population risk of the interpolator can be asymptotically optimal. Hastie et al. (2019); Wu and Xu (2020) studied linear regression in the setting where both the dimension and sample size grow together with a fixed ratio, and showed double descent of the risk with respect to this ratio. Chatterji and Long (2020) studied the population risk bounds of over-parameterized linear logistic regression on sub-Gaussian mixture models with label flipping noises, and showed how gradient descent can train over-parameterized linear models to achieve nearly optimal population risk. Cao et al. (2021) tightened the upper bound given by Chatterji and Long (2020) in the case without the label flipping noises, and established a matching lower bound of the risk achieved by over-parameterized maximum margin interpolators. Shamir (2022) proposed a generic data model for benign overfitting of linear predictors, and studied different problem settings under which benign overfitting can or cannot occur.

Besides the studies on linear models, several recent works also studied the benign overfitting and double descent phenomena in kernel methods or random feature models. Zhang et al. (2017) first pointed out that overfitting kernel predictors can sometimes still achieve good population risk. Liang and Rakhlin (2020) studied how interpolating kernel regression with radial basis function (RBF) kernels (and variants) can generalize and how the spectrum of the data covariance matrix affects the population risk of the interpolating kernel predictor. Li et al. (2021) studied the benign overfitting phenomenon of random feature models defined as two-layer neural networks whose first layer parameters are fixed at random initialization. Mei and Montanari (2019); Liao et al. (2020) demonstrated the double descent phenomenon for the population risk of interpolating random feature predictors with respect to the ratio between the dimensions of the random feature and the data input. Adlam and Pennington (2020) shows that neural tangent kernel (Jacot et al., 2018) based kernel regression has a triple descent risk curve with respect to the total number of trainable parameters. Montanari and Zhong (2020) further pointed out an interesting phase transition of the generalization error achieved by neural networks trained in the neural tangent kernel regime.

3 Problem Setup

In this section, we introduce the data generation model and the convolutional neural network we consider in this paper. We focus on binary classification, and present our data distribution $\mathcal{D}$ in the following definition.

Definition 3.1.

Let $\bm{\mu}\in\mathbb{R}^{d}$ be a fixed vector representing the signal contained in each data point. Then each data point $(\mathbf{x},y)$ with $\mathbf{x}=[\mathbf{x}_{1}^{\top},\mathbf{x}_{2}^{\top}]^{\top}\in\mathbb{R}^{2d}$ and $y\in\{-1,1\}$ is generated from the following distribution $\mathcal{D}$ :

1.

The label $y$ is generated as a Rademacher random variable.
2.

A noise vector $\bm{\xi}$ is generated from the Gaussian distribution $N(\mathbf{0},\sigma_{p}^{2}\cdot(\mathbf{I}-\bm{\mu}\bm{\mu}^{\top}\cdot\|\bm{\mu}\|_{2}^{-2}))$ .
3.

One of $\mathbf{x}_{1},\mathbf{x}_{2}$ is given as $y\cdot\bm{\mu}$ , which represents the signal, the other is given by $\bm{\xi}$ , which represents noises.

Our data generation model is inspired by image data, where the inputs consist of different patches, and only some of the patches are related to the class label of the image. In detail, the patch assigned as $y\cdot\bm{\mu}$ is the signal patch that is correlated to the label of the data, and the patch assigned as $\bm{\xi}$ is the noise patch that is independent of the label of the data and therefore is irrelevant for prediction. We assume that the noise patch is generated from the Gaussian distribution $N(\mathbf{0},\sigma_{p}^{2}\cdot(\mathbf{I}-\bm{\mu}\bm{\mu}^{\top}\cdot\|\bm{\mu}\|_{2}^{-2}))$ to ensure that the noise vector is orthogonal to the signal vector $\bm{\mu}$ for simplicity. Note that when the dimension $d$ is large, $\|\bm{\xi}\|_{2}\approx\sigma_{p}\sqrt{d}$ by standard concentration bounds. Therefore, we can treat $\|\bm{\mu}\|_{2}/(\sigma_{p}\sqrt{d})\approx\|\bm{\mu}\|_{2}/\|\bm{\xi}\|_{2}$ as the signal-to-noise ratio. For the ease of discussion, we denote $\mathrm{SNR}=\|\bm{\mu}\|_{2}/(\sigma_{p}\sqrt{d})$ . Note that the Bayes risk for learning our model is zero. We can also add label flip noise similar to Chatterji and Long (2020); Frei et al. (2022) to make the Bayes risk equal to the label flip noise and therefore nonzero, but this will not change the key message of our paper.

Intuitively, if a classifier learns the signal $\bm{\mu}$ and utilizes the signal patch of the data to make prediction, it can perfectly fit a given training data set $\{(\mathbf{x}_{i},y_{i}):i\in[n]\}$ and at the same time have a good performance on the test data. However, when the dimension $d$ is large ( $d>n$ ), a classifier that is a function of the noises $\bm{\xi}_{i}$ , $i\in[n]$ can also perfectly fit the training data set, while the prediction will be totally random on the new test data. Therefore, the data generation model given in Definition 3.1 is a useful model to study the population loss of overfitted classifiers. Similar models have been studied in some recent works by Li et al. (2019); Allen-Zhu and Li (2020a, b); Zou et al. (2021a).

Two-layer CNNs. We consider a two-layer convolutional neural network whose filters are applied to the two patches $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ separately, and the second layer parameters of the network are fixed as $+1/m$ and $-1/m$ respectively. Then the network can be written as $f(\mathbf{W},\mathbf{x})=F_{+1}(\mathbf{W}_{+1},\mathbf{x})-F_{-1}(\mathbf{W}_{-1},\mathbf{x})$ , where $F_{+1}(\mathbf{W}_{+1},\mathbf{x})$ , $F_{-1}(\mathbf{W}_{-1},\mathbf{x})$ are defined as:

\displaystyle F_{j}(\mathbf{W}_{j},\mathbf{x})

\displaystyle=\frac{1}{m}{\sum_{r=1}^{m}}\big{[}\sigma(\langle\mathbf{w}_{j,r},\mathbf{x}_{1}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\mathbf{x}_{2}\rangle)\big{]}=\frac{1}{m}{\sum_{r=1}^{m}}\big{[}\sigma(\langle\mathbf{w}_{j,r},y\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}\rangle)\big{]},

for $j\in\{+1,-1\}$ , $m$ is the number of convolutional filters in $F_{+1}$ and $F_{-1}$ . Here, $\sigma(z)=(\max\{0,z\})^{q}$ is the $\mathrm{ReLU}^{q}$ activation function where $q>2$ , $\mathbf{w}_{j,r}\in\mathbb{R}^{d}$ denotes the weight for the $r$ -th filter (i.e., neuron), and $\mathbf{W}_{j}$ is the collection of model weights associated with $F_{j}$ . We also use $\mathbf{W}$ to denote the collection of all model weights. We note that our CNN model can also be viewed as a CNN with average global pooling (Lin et al., 2013). We train the above CNN model by minimizing the empirical cross-entropy loss function

\displaystyle L_{S}(\bm{W})

\displaystyle=\frac{1}{n}{\sum_{i=1}^{n}}\ell[y_{i}\cdot f(\mathbf{W},\mathbf{x}_{i})],

where $\ell(z)=\log(1+\exp(-z))$ , and $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ is the training data set. We further define the true loss (test loss) $L_{\mathcal{D}}(\mathbf{W}):=\mathbb{E}_{(\mathbf{x},y)\sim\mathcal{D}}\ell[y\cdot f(\mathbf{W},\mathbf{x})]$ .

We consider gradient descent starting from Gaussian initialization, where each entry of $\mathbf{W}_{+1}$ and $\mathbf{W}_{-1}$ is sampled from a Gaussian distribution $N(0,\sigma_{0}^{2})$ , and $\sigma_{0}^{2}$ is the variance. The gradient descent update of the filters in the CNN can be written as

	$\displaystyle\mathbf{w}_{j,r}^{(t+1)}$	$\displaystyle=\mathbf{w}_{j,r}^{(t)}-\eta\cdot\nabla_{\mathbf{w}_{j,r}}L_{S}(\bm{W}^{(t)})$
		$\displaystyle=\mathbf{w}_{j,r}^{(t)}-\frac{\eta}{nm}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot jy_{i}\bm{\xi}_{i}-\frac{\eta}{nm}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\cdot j\bm{\mu}$		(3.1)

for $j\in\{\pm 1\}$ and $r\in[m]$ , where we introduce a shorthand notation $\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]$ .

4 Main Results

In this section, we present our main theoretical results. At the core of our analyses and results is a signal-noise decomposition of the filters in the CNN trained by gradient descent. By the gradient descent update rule (3.1), it is clear that the gradient descent iterate $\mathbf{w}_{j,r}^{(t)}$ is a linear combination of its random initialization $\mathbf{w}_{j,r}^{(0)}$ , the signal vector $\bm{\mu}$ and the noise vectors in the training data $\bm{\xi}_{i}$ , $i\in[n]$ . Motivated by this observation, we introduce the following definition.

Definition 4.1.

Let $\mathbf{w}_{j,r}^{(t)}$ for $j\in\{\pm 1\}$ , $r\in[m]$ be the convolution filters of the CNN at the $t$ -th iteration of gradient descent. Then there exist unique coefficients $\gamma_{j,r}^{(t)}\geq 0$ and $\rho_{j,r,i}^{(t)}$ such that

\displaystyle\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}+\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}.

We further denote $\overline{\rho}_{j,r,i}^{(t)}:=\rho_{j,r,i}^{(t)}\operatorname{\mathds{1}}(\rho_{j,r,i}^{(t)}\geq 0)$ , $\underline{\rho}_{j,r,i}^{(t)}:=\rho_{j,r,i}^{(t)}\operatorname{\mathds{1}}(\rho_{j,r,i}^{(t)}\leq 0)$ . Then we have that

\displaystyle\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}.

(4.1)

We refer to (4.1) as the signal-noise decomposition of $\mathbf{w}_{j,r}^{(t)}$ . We add normalization factors $\|\bm{\mu}\|_{2}^{-2},\|\bm{\xi}_{i}\|_{2}^{-2}$ in the definition so that $\gamma_{j,r}^{(t)}\approx\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle,\rho_{j,r,i}^{(t)}\approx\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$ . In this decomposition, $\gamma_{j,r}^{(t)}$ characterizes the progress of learning the signal vector $\bm{\mu}$ , and $\rho_{j,r,i}^{(t)}$ characterizes the degree of noise memorization by the filter. Evidently, based on this decomposition, for some iteration $t$ , (i) If some of $\gamma_{j,r}^{(t)}$ ’s are large enough while $|\rho_{j,r,i}^{(t)}|$ are relatively small, then the CNN will have small training and test losses; (ii) If some $\overline{\rho}_{j,r,i}^{(t)}$ ’s are large and all $\gamma_{j,r}^{(t)}$ ’s are small, then the CNN will achieve a small training loss, but a large test loss. Thus, Definition 4.1 provides a handle for us to study the convergence of the training loss as well as the the population loss of the CNN trained by gradient descent.

Our results are based on the following conditions on the dimension $d$ , sample size $n$ , neural network width $m$ , learning rate $\eta$ , initialization scale $\sigma_{0}$ .

Condition 4.2.

Suppose that

1.

Dimension $d$ is sufficiently large: $d=\widetilde{\Omega}(m^{2\vee[4/(q-2)]}n^{4\vee[(2q-2)/(q-2)]})$ .
2.

Training sample size $n$ and neural network width $m$ satisfy $n,m=\Omega(\operatorname{\rm polylog}(d))$ .
3.

The learning rate $\eta$ satisfies $\eta\leq\widetilde{O}(\min\{\|\bm{\mu}\|_{2}^{-2},\sigma_{p}^{-2}d^{-1}\})$ .
4.

The standard deviation of Gaussian initialization $\sigma_{0}$ is appropriately chosen such that $\widetilde{O}(nd^{-1/2})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}\leq\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-[1/(q-2)]\vee 1})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}$ .

A few remarks on Condition 4.2 are in order. The condition on $d$ is to ensure that the learning is in a sufficiently over-parameterized setting, and similar conditions have been made in the study of learning over-parameterized linear models (Chatterji and Long, 2020; Cao et al., 2021). For example, if we choose $q=3$ , then the condition on $d$ becomes $d=\widetilde{\Omega}(m^{4}n^{4})$ . Furthermore, we require the sample size and neural network width to be at least polylogarithmic in the dimension $d$ to ensure some statistical properties of the training data and weight initialization to hold with probability at least $1-d^{-1}$ , which is a mild condition. Finally, the conditions on $\sigma_{0}$ and $\eta$ are to ensure that gradient descent can effectively minimize the training loss, and they depend on the scale of the training data points. When $\sigma_{p}=O(d^{-1/2})$ and $\|\bm{\mu}\|_{2}=O(1)$ , the step size $\eta$ can be chosen as large as $\widetilde{O}(1)$ and the initialization $\sigma_{0}$ can be as large as $\widetilde{O}(m^{-2/(q-2)}n^{-[1/(q-2)]\vee 1})$ . In our paper, we only require $m,n=\Omega(\text{polylog}(d))$ , so our initialization and step-size can be chosen as an almost constant order. Based on these conditions, we give our main result on signal learning in the following theorem.

Theorem 4.3.

For any $\epsilon>0$ , let $T=\widetilde{\Theta}(\eta^{-1}m\sigma_{0}^{-(q-2)}\|\bm{\mu}\|_{2}^{-q}+\eta^{-1}\epsilon^{-1}m^{3}\|\bm{\mu}\|_{2}^{-2})$ . Under Condition 4.2, if $n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1)$ ¹¹1Here the $\widetilde{\Omega}(\cdot)$ hides an $\operatorname{\rm polylog}(\epsilon^{-1})$ factor. This applies to Theorem 4.4 as well., then with probability at least $1-d^{-1}$ , there exists $0\leq t\leq T$ such that:

1.

The CNN learns the signal: $\max_{r}\gamma_{j,r}^{(t)}=\Omega(1)$ for $j\in\{\pm 1\}$ .
2.

The CNN does not memorize the noises in the training data: $\max_{j,r,i}|\rho_{j,r,i}^{(T)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$ .
3.

The training loss converges to $\epsilon$ , i.e., $L_{S}(\mathbf{W}^{(t)})\leq\epsilon$ .
4.

The trained CNN achieves a small test loss: $L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\epsilon+\exp(-n^{2})$

Theorem 4.3 characterizes the case of signal learning. It shows that, if $n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1)$ , then at least one CNN filter can learn the signal by achieving $\gamma_{j,r_{j}^{*}}^{(t)}\geq\Omega(1)$ , and as a result, the learned neural network can achieve small training and test losses. To demonstrate the sharpness of this condition, we also present the following theorem for the noise memorization by the CNN.

Theorem 4.4.

For any $\epsilon>0$ , let $T=\widetilde{\Theta}(\eta^{-1}m\cdot n(\sigma_{p}\sqrt{d})^{-q}\cdot\sigma_{0}^{-(q-2)}+\eta^{-1}\epsilon^{-1}nm^{3}d^{-1}\sigma_{p}^{-2})$ . Under Condition 4.2, if $n^{-1}\cdot\mathrm{SNR}^{-q}=\widetilde{\Omega}(1)$ , then with probability at least $1-d^{-1}$ , there exists $0\leq t\leq T$ such that:

1.

The CNN memorizes noises in the training data: $\max_{r}\overline{\rho}_{y_{i},r,i}^{(t)}=\Omega(1)$ .
2.

The CNN does not sufficiently learn the signal: $\max_{j,r}\gamma_{j,r}^{(t)}\leq\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ .
3.

The training loss converges to $\epsilon$ , i.e., $L_{S}(\mathbf{W}^{(t)})\leq\epsilon$ .
4.

The trained CNN has a constant order test loss: $L_{\mathcal{D}}({\mathbf{W}}^{(t)})=\Theta(1)$ .

Refer to caption — Figure 1: Illustration of the phase transition between benign and harmful overfitting. The blue region represents the setting under which the overfitted CNN trained by gradient descent is guaranteed to have small population loss, and the yellow region represents the setting under which the population loss is guaranteed to be of constant order. The slim gray band region is the setting where the population loss is not well characterized.

Theorem 4.4 holds under the condition that $n^{-1}\cdot\mathrm{SNR}^{-q}=\widetilde{\Omega}(1)$ . Clearly, this is the opposite regime (up to some logarithmic factors) compared with Theorem 4.3. In this case, the CNN trained by gradient descent mainly memorizes noises in the training data and does not learn enough signal. This, together with the results in Theorem 4.3, reveals a clear phase transition between signal learning and noise memorization in CNN training:

•

If $n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1)$ , then the CNN learns the signal and achieves a $O(\epsilon+\exp(-n^{2}))$ test loss. This is the regime of benign overfitting.
•

If $n^{-1}\cdot\mathrm{SNR}^{-q}=\widetilde{\Omega}(1)$ then the CNN can only memorize noises and will have a $\Theta(1)$ test loss. This is the regime of harmful overfitting.

The phase transition is illustrated in Figure 1. Clearly, when learning a two-layer CNN on the data generated from Definition 3.1, $n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1)$ is the precise condition under which benign overfitting occurs. Remarkably, in this case the population loss decreases exponentially with the sample size $n$ . Under our condition that $n=\Omega(\operatorname{\rm polylog}(d))$ , this term can also be upper bounded by $1/\mathrm{poly}(d)$ , which is small in the high-dimensional setting. Note that when $\|\bm{\mu}\|_{2}=\Theta(1)$ and $\sigma_{p}=\Theta(d^{-1/2})$ , applying standard uniform convergence based bounds (Bartlett et al., 2017; Neyshabur et al., 2018) or stability based bounds (Hardt et al., 2016; Mou et al., 2017; Chen et al., 2018) typically gives a $\widetilde{O}(n^{-1/2})$ bound on the generalization gap, which is vacuous when $n=O(\operatorname{\rm polylog}(d))$ . Our bound under the same setting is $O(1/\mathrm{poly}(d))$ , which is non-vacuous. This is attributed to our precise analysis of signal learning and noise memorization in Theorems 4.3 and 4.4.

Comparison with neural tangent kernel (NTK) results. We want to emphasize that our analysis is beyond the so-called neural tangent kernel regime. In the NTK regime, it has been shown that gradient descent can train an over-parameterized neural network to achieve good training and test accuracies (Jacot et al., 2018; Du et al., 2019b, a; Allen-Zhu et al., 2019b; Zou et al., 2019; Arora et al., 2019a; Cao and Gu, 2019a; Chen et al., 2019). However, it is widely believed in literature that the NTK analyses cannot fully explain the success of deep learning, as the neural networks in the NTK regime are almost “linearized” (Lee et al., 2019; Cao and Gu, 2019a). Our analysis and results are not in the NTK regime: In the NTK regime, the network parameters stay close to their initialization throughout training, i.e., $\|\mathbf{W}^{(t)}-\mathbf{W}^{(0)}\|_{F}=O(1)$ , so that the NN model can be approximated by its linearization (Allen-Zhu et al., 2019b; Cao and Gu, 2019a; Chen et al., 2019). In comparison, our analysis does not rely on linearizing the neural network function, and $\|\mathbf{W}^{(t)}-\mathbf{W}^{(0)}\|_{F}$ can be as large as $O(\text{poly}(m))$ .

5 Overview of Proof Technique

In this section, we discuss the main challenges in the study of CNN training under our setting, and explain some key techniques we implement in our proofs to overcome these challenges. The complete proofs of all the results are given in the appendix.

Main challenges. Studying benign overfitting under our setting is a challenging task. The first challenge is the nonconvexity of the training objective function $L_{S}(\mathbf{W})$ . Nonconvexity has introduced new challenges in the study of benign overfitting particularly because our goal is not only to show the convergence of the training loss, but also to study the population loss in the over-parameterized setting, which requires a precise algorithmic analysis of the learning problem.

5.1 Iterative Analysis of the Signal-Noise Decomposition

In order to study the learning process based on the nonconvex optimization problem, we propose a key technique which enables the iterative analysis of the coefficients in the signal-noise decomposition in Definition 4.1. This technique is given in the following lemma.

Lemma 5.1.

The coefficients $\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)}$ in Definition 4.1 satisfy the following equations:

	$\displaystyle\gamma_{j,r}^{(0)},\overline{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j,r,i}^{(0)}=0,$		(5.1)
	$\displaystyle\gamma_{j,r}^{(t+1)}=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\\|\bm{\mu}\\|_{2}^{2},$		(5.2)
	$\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j),$		(5.3)
	$\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j).$		(5.4)

Remark 5.2.

With the decomposition (4.1), the signal learning and noise memorization processes of a CNN can be formally studied by analyzing the dynamics of $\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)}$ based on the dynamical system (5.2)-(5.4). Note that prior to our work, several existing results have utilized the inner products $\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle$ during the neural network training process in order to establish generalization bounds (Brutzkus et al., 2018; Chatterji and Long, 2020; Frei et al., 2021). Similar inner product based arguments are also implemented in Allen-Zhu and Li (2020a, b); Zou et al. (2021a), which study different topics related to learning neural networks. Compared with the inner product based argument, our method has two major advantages: (i) Based on the definition (5.2)-(5.4) and the fact that $\ell_{i}^{\prime(t)}<0$ , it is clear that $\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}$ are monotonically increasing, while $\underline{\rho}_{j,r,i}^{(t)}$ is monotonically decreasing throughout the whole training process. In comparison, monotonicity does not hold in the inner product based argument, especially for $\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$ . (ii) Our signal-noise decomposition also enables a clean homogeneity-based proof for the convergence of the training loss to an arbitrarily small error rate $\epsilon>0$ , which will be presented in Subsection 5.2.

With Lemma 5.1, we can reduce the study of the CNN learning process to the analysis of the discrete dynamical system given by (5.1)-(5.4). Our proof then focuses on a careful assessment of the values of the coefficients $\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)}$ throughout training. To prepare for more detailed analyses, we first present the following bounds of the coefficients, which hold throughout training.

Proposition 5.3.

Under Condition 4.2, for any $T^{*}=\eta^{-1}\text{poly}(\epsilon^{-1},\|\bm{\mu}\|_{2}^{-1},d^{-1}\sigma_{p}^{-2},\sigma_{0}^{-1},n,m,d)$ , the following bounds hold for $t\in[0,T^{*}]$ :

•

$0\leq\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\leq 4\log(T^{*})$ for all $j\in\{\pm 1\}$ , $r\in[m]$ and $i\in[n]$ .
•

$0\geq\underline{\rho}_{j,r,i}^{(t)}\geq-2\max_{i,j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\}-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\cdot 4\log(T^{*})$ for all $j\in\{\pm 1\}$ , $r\in[m]$ and $i\in[n]$ .

We can then prove the following lemma, which demonstrates that the training objective function $L_{S}(\mathbf{W})$ can dominate the gradient norm $\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}$ along the gradient descent path.

Lemma 5.4.

Under Condition 4.2, for any $T^{*}=\eta^{-1}\text{poly}(\epsilon^{-1},\|\bm{\mu}\|_{2}^{-1},d^{-1}\sigma_{p}^{-2},\sigma_{0}^{-1},n,m,d)$ , the following result holds for $t\in[0,T^{*}]$ :

\displaystyle\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}=O\big{(}\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\}\big{)}\cdot L_{S}(\mathbf{W}^{(t)}).

Lemma 5.4 plays a key role in the convergence proof of training loss function. However, note that our study of benign overfitting requires carefully monitoring the changes of the coefficients in the signal-noise decomposition, which cannot be directly done by Lemma 5.4. This is quite a challenging task, due to the complicated interactions among $\gamma_{j,r}^{(t)}$ , $\overline{\rho}_{j,r,i}^{(t)}$ and $\underline{\rho}_{j,r,i}^{(t)}$ . Note that even $\gamma_{j,r}^{(t)}$ , which has the simplest formula (5.2), depends on all the quantities $\gamma_{j^{\prime},r^{\prime}}^{(t)}$ , $\overline{\rho}_{j^{\prime},r^{\prime},i}^{(t)}$ and $\underline{\rho}_{j^{\prime},r^{\prime},i}^{(t)}$ for $j^{\prime}\in\{\pm 1\}$ , $r^{\prime}\in[m]$ and $i\in[n]$ . This is because the cross-entropy loss derivative term $\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]$ depends on all the neurons of the network. To overcome this challenge, we introduce in the next subsection a decoupling technique based on a two-stage analysis.

5.2 Decoupling with a Two-Stage Analysis.

We utilize a two-stage analysis to decouple the complicated relation among the coefficients $\gamma_{j,r}^{(t)}$ , $\overline{\rho}_{j,r,i}^{(t)}$ and $\underline{\rho}_{j,r,i}^{(t)}$ . Intuitively, the initial neural network weights are small enough so that the neural network at initialization has constant level cross-entropy loss derivatives on all the training data: $\ell_{i}^{\prime(0)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(0)},\mathbf{x}_{i})]=\Theta(1)$ for all $i\in[n]$ . This is guaranteed under Condition 4.2 and matches neural network training in practice. Motivated by this, we can consider the first stage of the training process where $\ell_{i}^{\prime(t)}=\Theta(1)$ , in which case we can show significant scale differences among $\gamma_{j,r}^{(t)}$ , $\overline{\rho}_{j,r,i}^{(t)}$ and $\underline{\rho}_{j,r,i}^{(t)}$ . Based on the result in the first stage, we then proceed to the second stage of the training process where the loss derivatives are no longer at a constant level and show that the training loss can be optimized to be arbitrarily small and meanwhile, the scale differences shown in the first learning stage remain the same throughout the training process. In the following, we focus on explaining the key proof steps for Theorem 4.3. The proof idea for Theorem 4.4 is similar, so we defer the details to the appendix.

Stage 1. It can be shown that, until some of the coefficients $\gamma_{j,r}^{(t)}$ , $\rho_{j,r,i}^{(t)}$ reach $\Theta(1)$ , we have $\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]=\Theta(1)$ for all $i\in[n]$ . Therefore, we first focus on this first stage of the training process, where the dynamics of the coefficients in (5.2) - (5.4) can be greatly simplified by replacing the $\ell_{i}^{\prime(t)}$ factors by their constant upper and lower bounds. The following lemma summarizes our main conclusion at stage 1 for signal learning:

Lemma 5.5.

Under the same conditions as Theorem 4.3, there exists $T_{1}=\widetilde{O}(\eta^{-1}m\sigma_{0}^{2-q}\|\bm{\mu}\|_{2}^{-q})$ such that

•

$\max_{r}\gamma_{j,r}^{(T_{1})}=\Omega(1)$ for $j\in\{\pm 1\}$ .
•

$|\rho_{j,r,i}^{(t)}|=O(\sigma_{0}\sigma_{p}\sqrt{d})$ for all $j\in\{\pm 1\}$ , $r\in[m]$ , $i\in[n]$ and $0\leq t\leq T_{1}$ .

Lemmas 5.5 takes advantage of the training period when the loss function derivatives remain a constant order to show that the CNN can capture the signal. At the end of stage 1 in signal learning, $\max_{r}\gamma_{j,r}$ reaches $\Theta(1)$ , and is significantly larger than $\rho_{j,r,i}^{(t)}$ . After this, it is no longer guaranteed that the loss derivatives $\ell_{i}^{\prime(t)}$ will remain constant order, and thus starts the training stage 2.

Stage 2. In this stage, we take into full consideration the exact definition $\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]$ and show that the training loss function will converge to $L_{S}(\mathbf{W}^{(t)})<\epsilon$ . Thanks to the analysis in stage 1, we know that some $\gamma_{j,r}^{(t)}$ is significantly larger than all $\rho_{j,r,i}^{(t)}$ ’s at the end of stage 1. This scale difference is the key to our analysis in stage 2. Based on this scale difference and the monotonicity of $\gamma_{j,r}^{(t)}$ , $\overline{\rho}_{j,r,i}^{(t)}$ , $\underline{\rho}_{j,r,i}^{(t)}$ in the signal-noise decomposition, it can be shown that there exists $\mathbf{W}^{*}$ such that $y_{i}\cdot\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle\geq q\log(2q/\epsilon)$ throughout stage 2. Moreover, since the neural network $f(\mathbf{W},\mathbf{x})$ is $q$ -homogeneous in $\mathbf{W}$ , we have $\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}),\mathbf{W}^{(t)}\rangle=q\cdot f(\mathbf{W}^{(t)},\mathbf{x})$ . Therefore,

	$\displaystyle\langle\nabla L_{S}(\mathbf{W}^{(t)}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot y_{i}\cdot\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot[y_{i}\cdot q\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})-y_{i}\cdot\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle]$
		$\displaystyle\geq\frac{1}{n}\sum_{i=1}^{n}\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]\cdot[y_{i}\cdot q\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})-q\log(2q/\epsilon)]$
		$\displaystyle\geq q\cdot\frac{1}{n}\sum_{i=1}^{n}[\ell(f(\mathbf{W}^{(t)},\mathbf{x}_{i}))-\ell(\log(2q/\epsilon))]$
		$\displaystyle\geq q\cdot L_{S}(\mathbf{W}^{(t)})-\epsilon/2,$

where the second inequality follows by the convexity of the cross-entropy loss function. With the above key technique, we can prove the following lemma.

Lemma 5.6.

Let $T,T_{1}$ be defined in Theorem 4.3 and Lemma 5.5 respectively. Then under the same conditions as Theorem 4.3, for any $t\in[T_{1},T]$ , it holds that $|\rho_{j,r,i}^{(t)}|\leq\sigma_{0}\sigma_{p}\sqrt{d}$ for all $j\in\{\pm 1\}$ , $r\in[m]$ and $i\in[n]$ . Moreover, let $\mathbf{W}^{*}$ be the collection of CNN parameters with convolution filters $\mathbf{w}^{*}_{j,r}=\mathbf{w}_{j,r}^{(0)}+2qm\log(2q/\epsilon)\cdot j\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}$ . Then the following bound holds

\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{(2q-1)}

for all $t\in[T_{1},T]$ , where we denote $\|\mathbf{W}\|_{F}=\sqrt{\|\mathbf{W}_{+1}\|_{F}^{2}+\|\mathbf{W}_{-1}\|_{F}^{2}}$ .

Lemma 5.6 states two main results on signal learning. First of all, during this training period, it is guaranteed that the coefficients of noise vectors $\rho_{j,r,i}^{(t)}$ in the signal-noise decomposition remain sufficiently small. Moreover, it also gives an optimization type result that the best iterate in $[T_{1},T]$ is small as long as $T$ is large enough.

Clearly, the convergence of the training loss stated in Theorems 4.3 directly follows by choosing $T$ to be sufficiently large in Lemmas 5.6. The lemma below further gives an upper bound on the test loss.

Lemma 5.7.

Let $T$ be defined in Theorem 4.3. Under the same conditions as Theorem 4.3, for any $t\leq T$ with $L_{S}(\mathbf{W}^{(t)})\leq 1$ , it holds that $L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\cdot L_{S}(\mathbf{W}^{(t)})+\exp(-n^{2})$ .

Below we finalize the proof of Theorem 4.3. The proofs of other results are in the appendix.

Proof of Theorem 4.3.

The first part of Theorem 4.3 follows by Lemma 5.5 and the monotonicity of $\gamma_{j,r}^{(t)}$ . The second part of Theorem 4.3 follows by Lemma 5.6. For the third part, let $\mathbf{W}^{*}$ be defined in Lemma 5.6. Then by the definition of $\mathbf{W}^{*}$ , we have

	$\displaystyle\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\\|_{F}$	$\displaystyle\leq\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\\|_{F}+\\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\\|_{F}$
		$\displaystyle\leq\sum_{j,r}\gamma_{j,r}^{(T_{1})}\\|\bm{\mu}\\|_{2}^{-1}+\sum_{j,r,i}\frac{\overline{\rho}_{j,r,i}^{(T_{1})}}{\\|\bm{\xi}_{i}\\|_{2}}+\sum_{j,r,i}\frac{\underline{\rho}_{j,r,i}^{(T_{1})}}{\\|\bm{\xi}_{i}\\|_{2}}+\Theta(m^{3/2}\log(1/\epsilon))\\|\bm{\mu}\\|_{2}^{-1}$
		$\displaystyle=\widetilde{O}(m^{3/2}\\|\bm{\mu}\\|_{2}^{-1}),$

where the first inequality is by triangle inequality, the second inequality is by the signal-noise decomposition of $\mathbf{W}^{(T_{1})}$ and the definition of $\mathbf{W}^{*}$ , and the last equality is by Proposition 5.3 and Lemma 5.5. Therefore, choosing $T=\widetilde{\Theta}(\eta^{-1}T_{1}+\eta^{-1}\epsilon^{-1}m^{3}\|\bm{\mu}\|_{2}^{-2})=\widetilde{\Theta}(\eta^{-1}\sigma_{0}^{-(q-2)}\|\bm{\mu}\|_{2}^{-q}+\eta^{-1}\epsilon^{-1}m^{3}\|\bm{\mu}\|_{2}^{-2})$ in Lemma 5.6 ensures that

\displaystyle\frac{1}{T-T_{1}+1}\sum_{t=T_{1}}^{T}L_{S}(\mathbf{W}^{(t)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(T-T_{1}+1)}+\frac{\epsilon}{2q-1}\leq\frac{\widetilde{O}(m^{3}\|\bm{\mu}\|_{2}^{-2})}{(2q-1)\eta(T-T_{1}+1)}+\frac{\epsilon}{2q-1}\leq\epsilon,

and there exists $t\in[T_{1},T]$ such that $L_{S}(\mathbf{W}^{(t)})\leq\epsilon$ . This completes the proof of the third part of Theorem 4.3. Finally, combining this bound with Lemma 5.7 gives

\displaystyle L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\cdot L_{S}(\mathbf{W}^{(t)})+\exp(-n^{2})\leq 6\epsilon+\exp(-n^{2}),

which proves the last part of Theorem 4.3. ∎

6 Conclusion and Future Work

This paper utilizes a signal-noise decomposition to study the signal learning and noise memorization process in the training of a two-layer CNN. We precisely give the conditions under which the CNN will mainly focus on learning signals or memorizing noises, and reveals a phase transition of the population loss with respect to the sample size, signal strength, noise level, and dimension. Our result theoretically demonstrates that benign overfitting can happen in neural network training. An important future work direction is to study the benign overfitting phenomenon of neural networks in learning other data models. Moreover, it is also important to generalize our analysis to deep convolutional neural networks.

Acknowledgements

We would like to thank Spencer Frei for valuable comment and discussion on the earlier version of this paper, and pointing out a related work.

Appendix A Additional Related Work

There has also been a large number of works studying the optimization and generalization of neural networks. A series of work (Li and Yuan, 2017; Soltanolkotabi, 2017; Du et al., 2018a, b; Zhong et al., 2017; Zhang et al., 2019; Cao and Gu, 2019b) studied the parameter recovery problem in two-layer neural networks, where the data are given by a teacher network and the task is to recover the parameters in the teacher network. These works either focus on the noiseless setting, or requires the number of training data points to be larger than the number of parameters in the network, and therefore does not cover the setting where the neural network can overfit the training data. Another line of works (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2018; Golowich et al., 2018; Arora et al., 2018) have studied the generalization gap between the training and test losses of neural networks with uniform convergence based arguments. However, these results are not algorithm-dependent and cannot explain benign overfitting. Some recent works studied the generalization gap based on stability based arguments (Bousquet and Elisseeff, 2002; Hardt et al., 2016; Mou et al., 2017; Chen et al., 2018). A more recent line of works studied the convergence (Jacot et al., 2018; Li and Liang, 2018; Du et al., 2019b; Allen-Zhu et al., 2019b; Du et al., 2019a; Zou et al., 2019) and test error bounds (Allen-Zhu et al., 2019a; Arora et al., 2019a, b; Cao and Gu, 2019a; Ji and Telgarsky, 2020; Chen et al., 2019) of over-parameterized networks in the neural tangent kernel regime. However, these works depend on the equivalence between neural network training and kernel methods, which cannot fully explain the success of deep learning. Compared with the works mentioned above, our work has a different focus which is to study the conditions for benign and harmful overfitting.

Appendix B Preliminary Lemmas

In this section, we present some pivotal lemmas that give some important properties of the data and the neural network parameters at their random initialization.

Lemma B.1.

Suppose that $\delta>0$ and $n\geq 8\log(4/\delta)$ . Then with probability at least $1-\delta$ ,

\displaystyle|\{i\in[n]:y_{i}=1\}|,~{}|\{i\in[n]:y_{i}=-1\}|\geq n/4.

Proof of Lemma B.1.

By Hoeffding’s inequality, with probability at least $1-\delta/2$ , we have

\displaystyle\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathds{1}}\{y_{i}=1\}-\frac{1}{2}\Bigg{|}\leq\sqrt{\frac{\log(4/\delta)}{2n}}.

Therefore, as long as $n\geq 8\log(4/\delta)$ , we have

\displaystyle|\{i\in[n]:y_{i}=1\}|=\sum_{i=1}^{n}\operatorname{\mathds{1}}\{y_{i}=1\}\geq\frac{n}{2}-n\cdot\sqrt{\frac{\log(4/\delta)}{2n}}\geq\frac{n}{4}.

This proves the result for $|\{i\in[n]:y_{i}=1\}|$ . The proof for $|\{i\in[n]:y_{i}=-1\}|$ is exactly the same, and we can conclude the proof by applying a union bound. ∎

The following lemma estimates the norms of the noise vectors $\bm{\xi}_{i}$ , $i\in[n]$ , and gives an upper bound of their inner products with each other.

Lemma B.2.

Suppose that $\delta>0$ and $d=\Omega(\log(4n/\delta))$ . Then with probability at least $1-\delta$ ,

	$\displaystyle\sigma_{p}^{2}d/2\leq\\|\bm{\xi}_{i}\\|_{2}^{2}\leq 3\sigma_{p}^{2}d/2,$
	$\displaystyle\|\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\rangle\|\leq 2\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}$

for all $i,i^{\prime}\in[n]$ .

Proof of Lemma B.2.

By Bernstein’s inequality, with probability at least $1-\delta/(2n)$ we have

\displaystyle\big{|}\|\bm{\xi}_{i}\|_{2}^{2}-\sigma_{p}^{2}d\big{|}=O(\sigma_{p}^{2}\cdot\sqrt{d\log(4n/\delta)}).

Therefore, as long as $d=\Omega(\log(4n/\delta))$ , we have

\displaystyle\sigma_{p}^{2}d/2\leq\|\bm{\xi}_{i}\|_{2}^{2}\leq 3\sigma_{p}^{2}d/2.

Moreover, clearly $\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\rangle$ has mean zero. For any $i,i^{\prime}$ with $i\neq i^{\prime}$ , by Bernstein’s inequality, with probability at least $1-\delta/(2n^{2})$ we have

\displaystyle|\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\rangle|\leq 2\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}.

Applying a union bound completes the proof. ∎

The following lemma studies the inner product between a randomly initialized CNN convolutional filter $\mathbf{w}_{j,r}^{(0)}$ , $j\in\{+1,-1\}$ and $r\in[m]$ and the signal/noise vectors in the training data. The calculations characterize how the neural network at initialization randomly captures signal and noise information.

Lemma B.3.

Suppose that $d\geq\Omega(\log(mn/\delta))$ , $m=\Omega(\log(1/\delta))$ . Then with probability at least $1-\delta$ ,

	$\displaystyle\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\|\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\\|\bm{\mu}\\|_{2},$
	$\displaystyle\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\|\leq 2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}$

for all $r\in[m]$ , $j\in\{\pm 1\}$ and $i\in[n]$ . Moreover,

	$\displaystyle\sigma_{0}\\|\bm{\mu}\\|_{2}/2\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\\|\bm{\mu}\\|_{2},$
	$\displaystyle\sigma_{0}\sigma_{p}\sqrt{d}/4\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\leq 2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}$

for all $j\in\{\pm 1\}$ and $i\in[n]$ .

Proof of Lemma B.3.

It is clear that for each $r\in[m]$ , $j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle$ is a Gaussian random variable with mean zero and variance $\sigma_{0}^{2}\|\bm{\mu}\|_{2}^{2}$ . Therefore, by Gaussian tail bound and union bound, with probability at least $1-\delta/4$ ,

\displaystyle j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\leq|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2}.

Moreover, $\mathbb{P}(\sigma_{0}\|\bm{\mu}\|_{2}/2>j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle)$ is an absolute constant, and therefore by the condition on $m$ , we have

	$\displaystyle\mathbb{P}\big{(}\sigma_{0}\\|\bm{\mu}\\|_{2}/2\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle)$	$\displaystyle=1-\mathbb{P}(\sigma_{0}\\|\bm{\mu}\\|_{2}/2>\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\big{)}$
		$\displaystyle=1-\mathbb{P}\big{(}\sigma_{0}\\|\bm{\mu}\\|_{2}/2>j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\big{)}^{2m}$
		$\displaystyle\geq 1-\delta/4.$

By Lemma B.2, with probability at least $1-\delta/4$ , $\sigma_{p}\sqrt{d}/\sqrt{2}\leq\|\bm{\xi}_{i}\|_{2}\leq\sqrt{3/2}\cdot\sigma_{p}\sqrt{d}$ for all $i\in[n]$ . Therefore, the result for $\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle$ follows the same proof as $j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle$ . ∎

Appendix C Signal-noise Decomposition Analysis

In this section, we establish a series of results on the signal-noise decomposition. These results are based on the conclusions in Section B, which hold with high probability. Denote by $\mathcal{E}_{\mathrm{prelim}}$ the event that all the results in Section B hold. Then for simplicity and clarity, we state all the results in this and the following sections conditional on $\mathcal{E}_{\mathrm{prelim}}$ .

Lemma C.1 (Restatement of Lemma 5.1).

The coefficients $\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)}$ defined in Definition 4.1 satisfy the following iterative equations:

	$\displaystyle\gamma_{j,r}^{(0)},\overline{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j,r,i}^{(0)}=0,$
	$\displaystyle\gamma_{j,r}^{(t+1)}=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\\|\bm{\mu}\\|_{2}^{2},$
	$\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j),$
	$\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j)$

for all $r\in[m]$ , $j\in\{\pm 1\}$ and $i\in[n]$ .

Proof of Lemma C.1.

By our data model in Definition 3.1 and Gaussian initialization of the CNN weights, it is clear that with probability $1$ , the vectors are linearly independent. Therefore, the decomposition (4.1) is unique. Now consider $\widetilde{\gamma}_{j,r}^{(0)},\widetilde{\rho}_{j,r,i}^{(0)}=0$ and

	$\displaystyle\widetilde{\gamma}_{j,r}^{(t+1)}=\widetilde{\gamma}_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\\|\bm{\mu}\\|_{2}^{2},$
	$\displaystyle\widetilde{\rho}_{j,r,i}^{(t+1)}=\widetilde{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot jy_{i},$

It is then easy to check by (3.1) that

\displaystyle\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\widetilde{\gamma}_{j,r}^{(t)}\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}+\sum_{i=1}^{n}\widetilde{\rho}_{j,r,i}^{(t)}\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}.

Hence by the uniqueness of the decomposition we have $\gamma_{j,r}^{(t)}=\widetilde{\gamma}_{j,r}^{(t)}$ and $\rho_{j,r,i}^{(t)}=\widetilde{\rho}_{j,r,i}^{(t)}$ . Then we have that

\displaystyle\rho_{j,r,i}^{(t)}=-\sum_{s=0}^{t-1}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(s)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot jy_{i}

Moreover, note that $\ell_{i}^{\prime(t)}<0$ by the definition of the cross-entropy loss. Therefore,

	$\displaystyle\overline{\rho}_{j,r,i}^{(t)}=-\sum_{s=0}^{t-1}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(s)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j),$		(C.1)
	$\displaystyle\underline{\rho}_{j,r,i}^{(t)}=-\sum_{s=0}^{t-1}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(s)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j).$		(C.2)

Writing out the iterative versions of (C.1) and (C.2) completes the proof. ∎

We can futher plug the signal-noise decomposition (4.1) into the iterative formulas in Lemma C.1. By the second equation in Lemma C.1, we have

\displaystyle\gamma_{j,r}^{(t+1)}

\displaystyle=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(y_{i}\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle+y_{i}\cdot j\cdot\gamma_{j,r}^{(t)})\cdot\|\bm{\mu}\|_{2}^{2},

(C.3)

Moreover, by the third equation in Lemma C.1, we have

\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}

\displaystyle=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2}

(C.4)

if $j=y_{i}$ , and $\overline{\rho}_{j,r,i}^{(t)}=0$ for all $t\geq 0$ if $j=-y_{i}$ . Similarly, by the last equation in Lemma C.1, we have

\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}

\displaystyle=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2}

(C.5)

if $j=-y_{i}$ , and $\underline{\rho}_{j,r,i}^{(t)}=0$ for all $t\geq 0$ if $j=y_{i}$ .

We will now show that the parameter of the signal-noise decomposition will stay a reasonable scale during a long time of training. Let us consider the learning period $0\leq t\leq T^{*}$ , where $T^{*}=\eta^{-1}\text{poly}(\epsilon^{-1},\|\bm{\mu}\|_{2}^{-1},d^{-1}\sigma_{p}^{-2},\sigma_{0}^{-1},n,m,d)$ is the maximum admissible iterations. Note that we can consider any polynomial training time $T^{*}$ . Denote $\alpha=4\log(T^{*})$ . Here we list the exact conditions for $\eta,\sigma_{0},d$ required by the proofs in this section, which are part of Condition 4.2:

	$\displaystyle\eta=O\Big{(}\min\{nm/(q\sigma_{p}^{2}d),nm/(q2^{q+2}\alpha^{q-2}\sigma_{p}^{2}d),nm/(q2^{q+2}\alpha^{q-2}\\|\bm{\mu}\\|_{2}^{2})\}\Big{)},$		(C.6)
	$\displaystyle\sigma_{0}\leq[16\sqrt{\log(8mn/\delta)}]^{-1}\min\{\\|\bm{\mu}\\|_{2}^{-1},(\sigma_{p}\sqrt{d})^{-1}\},$		(C.7)
	$\displaystyle d\geq 1024\log(4n^{2}/\delta)\alpha^{2}n^{2}.$		(C.8)

Denote $\beta=2\max_{i,j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\}$ . By Lemma B.3, with probability at least $1-\delta$ , we can upper bound $\beta$ by $4\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\cdot\max\{\|\bm{\mu}\|_{2},\sigma_{p}\sqrt{d}\}$ . Then, by (C.7) and (C.8), it is straightforward to verify the following inequality:

\displaystyle 4\max\bigg{\{}\beta,8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{\}}\leq 1.

(C.9)

Suppose the conditions listed in (C.6), (C.7) and (C.8) hold, we claim that for $0\leq t\leq T^{*}$ the following property holds.

Proposition C.2 (Restatement of Proposition 5.3).

Under Condition 4.2, for $0\leq t\leq T^{*}$ , we have that

	$\displaystyle 0\leq\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\leq\alpha,$		(C.10)
	$\displaystyle 0\geq\underline{\rho}_{j,r,i}^{(t)}\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\geq-\alpha.$		(C.11)

for all $r\in[m]$ , $j\in\{\pm 1\}$ and $i\in[n]$ .

We will use induction to prove Proposition C.2. We first introduce several technical lemmas that will be used for the proof of Proposition C.2.

Lemma C.3.

For any $t\geq 0$ , it holds that $\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle=j\cdot\gamma_{j,r}^{(t)}$ for all $r\in[m]$ , $j\in\{\pm 1\}$ .

Proof of Lemma C.3.

For any time $t\geq 0$ , we have that

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle$	$\displaystyle=j\cdot\gamma_{j,r}^{(t)}+\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\mu}\rangle+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\mu}\rangle$
		$\displaystyle=j\cdot\gamma_{j,r}^{(t)},$

where the equation is by our orthogonal assumption. ∎

Lemma C.4.

Under Condition 4.2, suppose (C.10) and (C.11) hold at iteration $t$ . Then

	$\displaystyle\underline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle$	$\displaystyle\leq\underline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,~{}j\not=y_{i},$
	$\displaystyle\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle$	$\displaystyle\leq\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,~{}j=y_{i}$

for all $r\in[m]$ , $j\in\{\pm 1\}$ and $i\in[n]$ .

Proof of Lemma C.4.

For $j\not=y_{i}$ , we have that $\overline{\rho}_{j,r,i}^{(t)}=0$ and

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle$	$\displaystyle=\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle$
		$\displaystyle\leq 4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}\|\overline{\rho}_{j,r,i^{\prime}}^{(t)}\|+4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}\|\underline{\rho}_{j,r,i^{\prime}}^{(t)}\|+\underline{\rho}_{j,r,i}^{(t)}$
		$\displaystyle\leq\underline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,$

where the second inequality is by Lemma B.2 and the last inequality is by $|\overline{\rho}^{(t)}_{j,r,i^{\prime}}|,|\underline{\rho}_{j,r,i^{\prime}}^{(t)}|\leq\alpha$ in (C.10) Similarly, for $y_{i}=j$ , we have that $\underline{\rho}_{j,r,i}^{(t)}=0$ and

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle$	$\displaystyle=\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle$
		$\displaystyle\leq\overline{\rho}_{j,r,i}^{(t)}+4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}\|\overline{\rho}_{j,r,i^{\prime}}^{(t)}\|+4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}\|\underline{\rho}_{j,r,i^{\prime}}^{(t)}\|$
		$\displaystyle\leq\overline{\rho}^{(t)}_{j,r,i}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,$

where the first inequality is by Lemma B.1 and the second inequality is by $|\overline{\rho}^{(t)}_{j,r,i^{\prime}}|,|\underline{\rho}_{j,r,i^{\prime}}^{(t)}|\leq\alpha$ in (C.10). Similarly, we can show that $\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\geq\underline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\log(4n^{2}/\delta)/d}\cdot\alpha$ and $\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\geq\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\log(4n^{2}/\delta)/d}\cdot\alpha$ , which completes the proof. ∎

Lemma C.5.

Under Condition 4.2, suppose (C.10) and (C.11) hold at iteration $t$ . Then

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle$	$\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle,$
	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$	$\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,$
	$\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i})$	$\displaystyle\leq 1$

for all $r\in[m]$ and $j\not=y_{i}$ .

Proof of Lemma C.5.

For $j\not=y_{i}$ , we have that

\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle=\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle+y_{i}\cdot j\cdot\gamma_{j,r}^{(t)}\leq\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle,

(C.12)

where the inequality is by $\gamma_{j,r}^{(t)}\geq 0$ . In addition, we have

\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\underline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

(C.13)

where the first inequality is by Lemma C.4 and the second inequality is due to $\underline{\rho}_{j,r,i}^{(t)}\leq 0$ . Then we can get that

	$\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i})$	$\displaystyle=\frac{1}{m}\sum_{r=1}^{m}[\sigma(\langle\mathbf{w}_{j,r}^{(t)},-j\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)]$
		$\displaystyle\leq 2^{q+1}\max_{j,r,i}\bigg{\{}\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\|,\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\|,8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{\}}^{q}$
		$\displaystyle\leq 1,$

where the first inequality is by (C.12), (C.13) and the second inequality is by (C.9). ∎

Lemma C.6.

Under Condition 4.2, suppose (C.10) and (C.11) hold at iteration $t$ . Then

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle$	$\displaystyle=\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle+\gamma_{j,r}^{(t)},$
	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$	$\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$

for all $r\in[m]$ , $j\in\{\pm 1\}$ and $i\in[n]$ . If $\max\{\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\}=O(1)$ , we further have that $F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i})=O(1)$ .

Proof of Lemma C.6.

For $j=y_{i}$ , we have that

\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle=\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle+\gamma_{j,r}^{(t)},

(C.14)

where the equation is by Lemma C.3. We also have that

\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

(C.15)

where the inequality is by Lemma C.4. If $\max\{\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\}=O(1)$ , we have following bound

	$\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i})$	$\displaystyle=\frac{1}{m}\sum_{r=1}^{m}[\sigma(\langle\mathbf{w}_{j,r}^{(t)},-j\cdot\bm{\mu})+\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)]$
		$\displaystyle\leq 2\cdot 3^{q}\max_{j,r,i}\bigg{\{}\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\|,\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\|,8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{\}}^{q}$
		$\displaystyle=O(1),$

where the first inequality is by (C.14), (C.15) and the second inequality is by (C.9) where $\beta=2\max_{i,j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\}$ . ∎

Now we are ready to prove Proposition C.2.

Proof of Proposition C.2.

Our proof is based on induction. The results are obvious at $t=0$ as all the coefficients are zero. Suppose that there exists $\widetilde{T}\leq T^{*}$ such that the results in Proposition C.2 hold for all time $0\leq t\leq\widetilde{T}-1$ . We aim to prove that they also hold for $t=\widetilde{T}$ .

We first prove that (C.11) holds for $t=\widetilde{T}$ , i.e., $\underline{\rho}^{(t)}_{j,r,i}\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$ for $t=\widetilde{T}$ , $r\in[m]$ , $j\in\{\pm 1\}$ and $i\in[n]$ . Notice that $\underline{\rho}_{j,r,i}^{(t)}=0,\forall j=y_{i}$ . Therefore, we only need to consider the case that $j\not=y_{i}$ . When $\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}\leq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$ , by Lemma C.4 we have that

\displaystyle\langle\mathbf{w}_{j,r}^{(\widetilde{T}-1)},\bm{\xi}_{i}\rangle\leq\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}+\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0,

and thus

	$\displaystyle\underline{\rho}_{j,r,i}^{(\widetilde{T})}$	$\displaystyle=\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(\widetilde{T}-1)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(\widetilde{T}-1)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=-j)\\|\bm{\xi}_{i}\\|_{2}^{2}$
		$\displaystyle=\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}$
		$\displaystyle\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,$

where the last inequality is by induction hypothesis. When $\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}\geq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$ , we have that

	$\displaystyle\underline{\rho}_{j,r,i}^{(\widetilde{T})}$	$\displaystyle=\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(\widetilde{T}-1)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(T-1)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=-j)\\|\bm{\xi}_{i}\\|_{2}^{2}$
		$\displaystyle\geq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha-O\bigg{(}\frac{\eta\sigma_{p}^{2}d}{nm}\bigg{)}\sigma^{\prime}\bigg{(}0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{)}$
		$\displaystyle\geq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha-O\bigg{(}\frac{\eta q\sigma_{p}^{2}d}{nm}\bigg{)}\bigg{(}0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{)}$
		$\displaystyle\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,$

where we use $\ell_{i}^{\prime(\widetilde{T}-1)}\leq 1$ and $\|\bm{\xi}_{i}\|_{2}=O(\sigma_{p}^{2}d)$ in the first inequality, the second inequality is by $0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 1$ , and the last inequality is by $\eta=O\big{(}nm/(q\sigma_{p}^{2}d)\big{)}$ in (C.6).

Next we prove (C.10) holds for $t=\widetilde{T}$ . We have

$\displaystyle\|\ell_{i}^{\prime(t)}\|$	$\displaystyle=\frac{1}{1+\exp\{y_{i}\cdot[F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})-F_{-1}(\mathbf{W}_{-1}^{(t)},\mathbf{x}_{i})]\}}$
	$\displaystyle\leq\exp\{-y_{i}\cdot[F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})-F_{-1}(\mathbf{W}_{-1}^{(t)},\mathbf{x}_{i})]\}$
	$\displaystyle\leq\exp\{-F_{y_{i}}(\mathbf{W}_{y_{i}}^{(t)},\mathbf{x}_{i})+1\}.$	(C.16)

where the last inequality is due to Lemma C.5. Moreover, recall the update rule of $\gamma_{j,r}^{(t)}$ and $\overline{\rho}_{j,r,i}^{(t)}$ ,

	$\displaystyle\gamma_{j,r}^{(t+1)}$	$\displaystyle=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\\|\bm{\mu}\\|_{2}^{2},$
	$\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}$	$\displaystyle=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\\|\bm{\xi}_{i}\\|_{2}^{2}.$

Let $t_{j,r,i}$ to be the last time $t<T^{*}$ that $\overline{\rho}_{j,r,i}^{(t)}\leq 0.5\alpha$ . Then we have that

	$\displaystyle\overline{\rho}_{j,r,i}^{(\widetilde{T})}$	$\displaystyle=\overline{\rho}_{j,r,i}^{(t_{j,r,i})}-\underbrace{\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t_{j,r,i})}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t_{j,r,i})},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\\|\bm{\xi}_{i}\\|_{2}^{2}}_{I_{1}}$
		$\displaystyle\qquad-\underbrace{\sum_{t_{j,r,i}<t<T}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\\|\bm{\xi}_{i}\\|_{2}^{2}}_{I_{2}}.$		(C.17)

We first bound $I_{1}$ as follows,

\displaystyle|I_{1}|\leq 2qn^{-1}m^{-1}\eta\bigg{(}\overline{\rho}_{j,r,i}^{(t_{j,r,i})}+0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{)}^{q-1}\sigma_{p}^{2}d\leq q2^{q}n^{-1}m^{-1}\eta\alpha^{q-1}\sigma_{p}^{2}d\leq 0.25\alpha,

where the first inequality is by Lemmas C.4 and B.2, the second inequality is by $\beta\leq 0.1\alpha$ and $8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0.1\alpha$ , the last inequality is by $\eta\leq nm/(q2^{q+2}\alpha^{q-2}\sigma_{p}^{2}d)$ .

Second, we bound $I_{2}$ . For $t_{j,r,i}<t<\widetilde{T}$ and $y_{i}=j$ , we can lower bound $\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$ as follows,

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$	$\displaystyle\geq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\geq-0.5\beta+0.5\alpha-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\geq 0.25\alpha,$

where the first inequality is by Lemma C.4, the second inequality is by $\overline{\rho}_{j,r,i}^{(t)}>0.5\alpha$ and $\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\geq-0.5\beta$ due to the definition of $t_{j,r,i}$ and $\beta$ , the last inequality is by $\beta\leq 0.1\alpha$ and $8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0.1\alpha$ . Similarly, for $t_{j,r,i}<t<\widetilde{T}$ and $y_{i}=j$ , we can also upper bound $\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$ as follows,

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$	$\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\leq 0.5\beta+\alpha+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\leq 2\alpha,$

where the first inequality is by Lemma C.4, the second inequality is by induction hypothesis $\overline{\rho}_{j,r,i}^{(t)}\leq\alpha$ , the last inequality is by $\beta\leq 0.1\alpha$ and $8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0.1\alpha$ . Thus, plugging the upper and lower bounds of $\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$ into $I_{2}$ gives

	$\displaystyle\|I_{2}\|$	$\displaystyle\leq\sum_{t_{j,r,i}<t<\widetilde{T}}\frac{\eta}{nm}\cdot\exp(-\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)+1)\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\\|\bm{\xi}_{i}\\|_{2}^{2}$
		$\displaystyle\leq\frac{eq2^{q}\eta T^{*}}{n}\exp(-\alpha^{q}/4^{q})\alpha^{q-1}\sigma_{p}^{2}d$
		$\displaystyle\leq 0.25T^{*}\exp(-\alpha^{q}/4^{q})\alpha$
		$\displaystyle\leq 0.25T^{}\exp(-\log(T^{})^{q})\alpha$
		$\displaystyle\leq 0.25\alpha,$

where the first inequality is by (C.16), the second inequality is by Lemma B.2, the third inequality is by $\eta=O\big{(}nm/(q2^{q+2}\alpha^{q-2}\sigma_{p}^{2}d)\big{)}$ in (C.6), the fourth inequality is by our choice of $\alpha=4\log(T^{*})$ and the last inequality is due to the fact that $\log(T^{*})^{q}\geq\log(T^{*})$ . Plugging the bound of $I_{1},I_{2}$ into (C.17) completes the proof for $\overline{\rho}$ . Similarly, we can prove that $\gamma_{j,r}^{(\widetilde{T})}\leq\alpha$ using $\eta=O\big{(}nm/(q2^{q+2}\alpha^{q-2}\|\bm{\mu}\|_{2}^{2})\big{)}$ in (C.6). Therefore Proposition C.2 holds for $t=\widetilde{T}$ , which completes the induction. ∎

Based on Proposition C.2, we introduce some important properties of the training loss function for $0\leq t\leq T^{*}$ .

Lemma C.7 (Restatement of Lemma 5.4).

Under Condition 4.2, for $0\leq t\leq T^{*}$ , the following result holds.

\displaystyle\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}\leq O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\})L_{S}(\mathbf{W}^{(t)}).

Proof of Lemma C.7.

We first prove that

\displaystyle-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))\cdot\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}\big{)}\|_{F}^{2}=O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\}).

(C.18)

Without loss of generality, we suppose that $y_{i}=1$ and $\mathbf{x}_{i}=[\bm{\mu}^{\top},\bm{\xi}_{i}]$ . Then we have that

	$\displaystyle\\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\\|_{F}$	$\displaystyle\leq\frac{1}{m}\sum_{j,r}\bigg{\\|}\big{[}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle)\bm{\mu}+\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\bm{\xi}_{i}\big{]}\bigg{\\|}_{2}$
		$\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle)\\|\bm{\mu}\\|_{2}+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\\|\bm{\xi}_{i}\\|_{2}$
		$\displaystyle\leq 2q\bigg{[}F_{+1}(\mathbf{W}^{(t)}_{+1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}$
		$\displaystyle\qquad+2q\bigg{[}F_{-1}(\mathbf{W}^{(t)}_{-1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}$
		$\displaystyle\leq 2q\bigg{[}F_{+1}(\mathbf{W}^{(t)}_{+1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}+2q\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\},$

where the first and second inequalities are by triangle inequality, the third inequality is by Jensen’s inequality and Lemma B.2, and the last inequality is due to Lemma C.5. Denote $A=F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})$ . Then we have that $A\geq 0$ , and besides, $F_{-1}(\mathbf{W}^{(t)}_{-1},\mathbf{x}_{i})\leq 1$ by Lemma C.5. Then we have that

	$\displaystyle-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\cdot\\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\\|_{F}^{2}$
	$\displaystyle\qquad\leq-\ell^{\prime}(A-1)\bigg{(}2q\cdot A^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}+2q\cdot\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}\bigg{)}^{2}$
	$\displaystyle\qquad=-4q^{2}\ell^{\prime}(A-1)(A^{(q-1)/q}+1)^{2}\cdot\max\{\\|\bm{\mu}\\|_{2}^{2},4\sigma_{p}^{2}d\}$
	$\displaystyle\qquad\leq\big{(}\max_{z>0}-4q^{2}\ell^{\prime}(z-1)(z^{(q-1)/q}+1)^{2}\big{)}\max\{\\|\bm{\mu}\\|_{2}^{2},4\sigma_{p}^{2}d\}$
	$\displaystyle\qquad\overset{(i)}{=}O(\max\{\\|\bm{\mu}\\|_{2}^{2},\sigma_{p}^{2}d\}),$

where (i) is by $\max_{z\geq 0}-4q^{2}\ell^{\prime}(z-1)(z^{(q-1)/q}+1)^{2}<\infty$ because $\ell^{\prime}$ has an exponentially decaying tail. Now we can upper bound the gradient norm $\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}$ as follows,

	$\displaystyle\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$	$\displaystyle\leq\bigg{[}\frac{1}{n}\sum_{i=1}^{n}\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\\|_{F}\bigg{]}^{2}$
		$\displaystyle\leq\bigg{[}\frac{1}{n}\sum_{i=1}^{n}\sqrt{-O(\max\{\\|\bm{\mu}\\|_{2}^{2},\sigma_{p}^{2}d\})\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}}\bigg{]}^{2}$
		$\displaystyle\leq O(\max\{\\|\bm{\mu}\\|_{2}^{2},\sigma_{p}^{2}d\})\cdot\frac{1}{n}\sum_{i=1}^{n}-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}$
		$\displaystyle\leq O(\max\{\\|\bm{\mu}\\|_{2}^{2},\sigma_{p}^{2}d\})L_{S}(\mathbf{W}^{(t)}),$

where the first inequality is by triangle inequality, the second inequality is by (C.18), the third inequality is by Cauchy-Schwartz inequality and the last inequality is due to the property of the cross entropy loss $-\ell^{\prime}\leq\ell$ . ∎

Appendix D Signal Learning

In this section, we consider the signal learning case under the condition that $n\|\bm{\mu}\|_{2}^{q}\geq\widetilde{\Omega}(\sigma_{p}^{q}(\sqrt{d})^{q})$ . We remind the readers that the proofs in this section are based on the results in Section B, which hold with high probability.

D.1 First stage

Lemma D.1 (Restatement of Lemma 5.5).

Under the same conditions as Theorem 4.3, in particular if we choose

\displaystyle n\cdot\mathrm{SNR}^{q}\geq C\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{2q+6}[4\log(8mn/\delta)]^{(q-1)/2},

(D.1)

where $C=O(1)$ is a positive constant, there exists time

\displaystyle T_{1}=\frac{C\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{q+1}m}{\eta\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}

such that

•

$\max_{r}\gamma_{j,r}^{(T_{1})}\geq 2$ for $j\in\{\pm 1\}$ .
•

$|\rho_{j,r,i}^{(t)}|\leq\sigma_{0}\sigma_{p}\sqrt{d}/2$ for all $j\in\{\pm 1\},r\in[m]$ , $i\in[n]$ and $0\leq t\leq T_{1}$ .

Proof of Lemma D.1.

Let

\displaystyle T_{1}^{+}=\frac{nm\eta^{-1}\sigma_{0}^{2-q}\sigma_{p}^{-q}d^{-q/2}}{2^{q+4}q[4\log(8mn/\delta)]^{(q-1)/2}}.

(D.2)

We first prove the second bullet. Define $\Psi^{(t)}=\max_{j,r,i}|\rho_{j,r,i}^{(t)}|=\max_{j,r,i}\{\overline{\rho}_{j,r,i}^{(t)},-\underline{\rho}_{j,r,i}^{(t)}\}$ . We use induction to show that

\displaystyle\Psi^{(t)}\leq\sigma_{0}\sigma_{p}\sqrt{d}/2

(D.3)

for all $0\leq t\leq T_{1}^{+}$ . By definition, clearly we have $\Psi^{(0)}=0$ . Now suppose that there exists some $\widetilde{T}\leq T_{1}^{+}$ such that (D.3) holds for $0<t\leq\widetilde{T}-1$ . Then by (C.4) and (C.5) we have

	$\displaystyle\Psi^{(t+1)}$	$\displaystyle\leq\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\|\ell_{i}^{\prime(t)}\|\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}}+\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}}\Bigg{)}\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\bigg{\}}$
		$\displaystyle\leq\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\cdot\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}}\Bigg{)}\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\bigg{\}}$
		$\displaystyle=\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\Psi^{(t)}+2\cdot\sum_{i^{\prime}\neq i}^{n}\Psi^{(t)}\cdot\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}}\Bigg{)}\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\bigg{\}}$
		$\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\Bigg{[}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+\Bigg{(}2+\frac{4n\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}}{\sigma_{p}^{2}d/2}\Bigg{)}\cdot\Psi^{(t)}\Bigg{]}^{q-1}\cdot 2\sigma_{p}^{2}d$
		$\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\big{(}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+4\Psi^{(t)}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d$
		$\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\big{(}4\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d,$

where the second inequality is by $|\ell_{i}^{\prime(t)}|\leq 1$ , the third inequality is due to Lemmas B.2 and B.3, the fourth inequality follows by the condition that $d\geq 16n^{2}\log(4n^{2}/\delta)$ , and the last inequality follows by the induction hypothesis (D.3). Taking a telescoping sum over $t=0,1,\ldots,\widetilde{T}-1$ then gives

	$\displaystyle\Psi^{(\widetilde{T})}$	$\displaystyle\leq\widetilde{T}\cdot\frac{\eta q}{nm}\cdot\big{(}4\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d$
		$\displaystyle\leq T_{1}^{+}\cdot\frac{\eta q}{nm}\cdot\big{(}4\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d$
		$\displaystyle\leq\frac{\sigma_{0}\sigma_{p}\sqrt{d}}{2},$

where the second inequality follows by $\widetilde{T}\leq T_{1}^{+}$ in our induction hypothesis. Therefore, by induction, we have $\Psi^{(t)}\leq\sigma_{0}\sigma_{p}\sqrt{d}/2$ for all $t\leq T_{1}^{+}$ .

Now, without loss of generality, let us consider $j=1$ first. Denote by $T_{1,1}$ the last time for $t$ in the period $[0,T_{1}^{+}]$ satisfying that $\max_{r}\gamma_{1,r}^{(t)}\leq 2$ . Then for $t\leq T_{1,1}$ , $\max_{j,r,i}\{|\rho_{j,r,i}^{(t)}|\}=O(\sigma_{0}\sigma_{p}\sqrt{d})=O(1)$ and $\max_{r}\gamma_{1,r}^{(t)}\leq 2$ . Therefore, by Lemmas C.5 and C.6, we know that $F_{-1}(\mathbf{W}_{-1}^{(t)},\mathbf{x}_{i}),F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})=O(1)$ for all $i$ with $y_{i}=1$ . Thus there exists a positive constant $C_{1}$ such that $-\ell^{\prime(t)}_{i}\geq C_{1}$ for all $i$ with $y_{i}=1$ .

By (C.3), for $t\leq T_{1,1}$ we have

	$\displaystyle\gamma_{1,r}^{(t+1)}$	$\displaystyle=\gamma_{1,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(y_{i}\cdot\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle+y_{i}\cdot\gamma_{1,r}^{(t)})\cdot\\|\bm{\mu}\\|_{2}^{2}$
		$\displaystyle\geq\gamma_{1,r}^{(t)}+\frac{C_{1}\eta}{nm}\cdot\sum_{y_{i}=1}\sigma^{\prime}(\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle+\gamma_{1,r}^{(t)})\cdot\\|\bm{\mu}\\|_{2}^{2}.$

Denote $\widehat{\gamma}_{1,r}^{(t)}=\gamma_{1,r}^{(t)}+\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle$ and let $A^{(t)}=\max_{r}\widehat{\gamma}_{1,r}^{(t)}$ . Then we have

	$\displaystyle A^{(t+1)}$	$\displaystyle\geq A^{(t)}+\frac{C_{1}\eta}{nm}\cdot\sum_{y_{i}=1}\sigma^{\prime}(A^{(t)})\cdot\\|\bm{\mu}\\|_{2}^{2}$
		$\displaystyle\geq A^{(t)}+\frac{C_{1}\eta q\\|\bm{\mu}\\|_{2}^{2}}{4m}\big{[}A^{(t)}\big{]}^{q-1}$
		$\displaystyle\geq\bigg{(}1+\frac{C_{1}\eta q\\|\bm{\mu}\\|_{2}^{2}}{4m}\big{[}A^{(0)}\big{]}^{q-2}\bigg{)}A^{(t)}$
		$\displaystyle\geq\bigg{(}1+\frac{C_{1}\eta q\sigma_{0}^{q-2}\\|\bm{\mu}\\|_{2}^{q}}{2^{q}m}\bigg{)}A^{(t)},$

where the second inequality is by the lower bound on the number of positive data in Lemma B.1, the third inequality is due to the fact that $A^{(t)}$ is an increasing sequence, and the last inequality follows by $A^{(0)}=\max_{r}\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle\geq\sigma_{0}\|\bm{\mu}\|_{2}/2$ proved in Lemma B.3. Therefore, the sequence $A^{(t)}$ will exponentially grow and we have that

\displaystyle A^{(t)}

\displaystyle\geq\bigg{(}1+\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q}m}\bigg{)}^{t}A^{(0)}\geq\exp\bigg{(}\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q+1}m}t\bigg{)}A^{(0)}\geq\exp\bigg{(}\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q+1}m}t\bigg{)}\frac{\sigma_{0}\|\bm{\mu}\|_{2}}{2},

where the second inequality is due to the fact that $1+z\geq\exp(z/2)$ for $z\leq 2$ and our condition of $\eta$ and $\sigma_{0}$ listed in Condition 4.2, and the last inequality follows by Lemma B.3 and $A^{(0)}=\max_{r}\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle$ . Therefore, $A^{(t)}$ will reach $3$ within $T_{1}=\frac{\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{q+1}m}{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}$ iterations. Since $\max_{r}\gamma_{1,r}^{(t)}\geq A^{(t)}-\max_{r}|\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle|\geq A^{(t)}-1$ , $\max_{r}\gamma_{1,r}^{(t)}$ will reach $2$ within $T_{1}$ iterations. We can next verify that

\displaystyle T_{1}=\frac{\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{q+1}m}{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}\leq\frac{nm\eta^{-1}\sigma_{0}^{2-q}\sigma_{p}^{-q}d^{-q/2}}{2^{q+5}q[4\log(8mn/\delta)]^{(q-1)/2}}=T_{1}^{+}/2,

where the inequality holds due to our SNR condition in (D.1). Therefore, by the definition of $T_{1,1}$ , we have $T_{1,1}\leq T_{1}\leq T_{1}^{+}/2$ , where we use the non-decreasing property of $\gamma$ . The proof for $j=-1$ is similar, and we can prove that $\max_{r}\gamma_{-1,r}^{(T_{1,-1})}\geq 2$ while $T_{1,-1}\leq T_{1}\leq T_{1}^{+}/2$ , which completes the proof. ∎

D.2 Second Stage

By the results we get in the first stage we know that

\displaystyle\mathbf{w}_{j,r}^{(T_{1})}

\displaystyle=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(T_{1})}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}.

And at the beginning of the second stage, we have following property holds:

•

$\max_{r}\gamma_{j,r}^{(T_{1})}\geq 2,\forall j\in\{\pm 1\}$ .
•

$\max_{j,r,i}|\rho_{j,r,i}^{(T_{1})}|\leq\widehat{\beta}$ where $\widehat{\beta}=\sigma_{0}\sigma_{p}\sqrt{d}/2$ .

Lemma 5.1 implies that the learned feature $\gamma_{j,r}^{(t)}$ will not get worse, i.e., for $t\geq T_{1}$ , we have that $\gamma_{j,r}^{(t+1)}\geq\gamma_{j,r}^{(t)}$ , and therefore $\max_{r}\gamma_{j,r}^{(t)}\geq 2$ . Now we choose $\mathbf{W}^{*}$ as follows:

\displaystyle\mathbf{w}^{*}_{j,r}=\mathbf{w}_{j,r}^{(0)}+2qm\log(2q/\epsilon)\cdot j\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}.

Based on the above definition of $\mathbf{W}^{*}$ , we have the following lemma.

Lemma D.2.

Under the same conditions as Theorem 4.4, we have that $\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}\leq\widetilde{O}(m^{3/2}\|\bm{\mu}\|_{2}^{-1})$ .

Proof of Lemma D.2.

We have

	$\displaystyle\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\\|_{F}$	$\displaystyle\leq\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\\|_{F}+\\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\\|_{F}$
		$\displaystyle\leq\sum_{j,r}\frac{\gamma_{j,r}^{(T_{1})}}{\\|\bm{\mu}\\|_{2}}+\sum_{j,r,i}\frac{\|\overline{\rho}_{j,r,i}^{(T_{1})}\|}{\\|\bm{\xi}_{i}\\|_{2}}+\sum_{j,r,i}\frac{\|\underline{\rho}_{j,r,i}^{(T_{1})}\|}{\\|\bm{\xi}_{i}\\|_{2}}+O(m^{3/2}\log(1/\epsilon))\\|\bm{\mu}\\|_{2}^{-1}$
		$\displaystyle\leq\widetilde{O}(m\\|\bm{\mu}\\|^{-1})+O(nm\sigma_{0})+O(m^{3/2}\log(1/\epsilon))\\|\bm{\mu}\\|_{2}^{-1}$
		$\displaystyle\leq\widetilde{O}(m^{3/2}\\|\bm{\mu}\\|_{2}^{-1}),$

where the first inequality is by triangle inequality, the second inequality is by our decomposition of $\mathbf{W}^{(T_{1})}$ and the definition of $\mathbf{W}^{*}$ , the third inequality is by Proposition C.2 and Lemma D.1, and the last inequality is by our condition of $\sigma_{0}$ in Condition 4.2. ∎

Lemma D.3.

Under the same conditions as Theorem 4.3, we have that $y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle\geq q\log(2q/\epsilon)$ for all $i\in[n]$ and $T_{1}\leq t\leq T^{*}$ .

Proof of Lemma D.3.

Recall that $f(\mathbf{W}^{(t)},\mathbf{x}_{i})=(1/m){\sum_{j,r}}j\cdot\big{[}\sigma(\langle\mathbf{w}_{j,r},y_{i}\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}_{i}\rangle)\big{]}$ , so we have

$\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle$	$\displaystyle=\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{}\rangle+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{}\rangle$
	$\displaystyle=\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)2qm\log(2q/\epsilon)+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{(0)}\rangle$
	$\displaystyle\qquad+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{(0)}\rangle$
	$\displaystyle\geq\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)2qm\log(2q/\epsilon)-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\widetilde{O}(\sigma_{0}\\|\bm{\mu}\\|_{2})$
	$\displaystyle\qquad-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}),$	(D.4)

where the inequality is by Lemma B.3. Next we will bound the inner-product terms in (D.4) respectively. By Lemma C.6, we have that for $j=y_{i}$

\displaystyle\max_{r}\{\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle\}=\max_{r}\{\gamma_{j,r}^{(t)}+\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle\}\geq 2-\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})\geq 1.

(D.5)

We can also get the upper bound of the inner products between the parameter and the signal (noise) as follows,

	$\displaystyle\|\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle\|\overset{(i)}{\leq}\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\|+\|\gamma_{j,r}^{(t)}\|\overset{(ii)}{\leq}\widetilde{O}(1)$
	$\displaystyle\|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\|\overset{(iii)}{\leq}\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\|+\|\underline{\rho}_{j,r,i}^{(t)}\|+\|\overline{\rho}_{j,r,i}^{(t)}\|+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\overset{(iv)}{\leq}\widetilde{O}(1),$		(D.6)

where (i) is by Lemma C.3, (iii) is by Lemma C.4, (ii) and (iv) are due to Proposition C.2. Plugging (D.5) and (D.6) into (D.4) gives,

\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle

\displaystyle\geq 2q\log(2q/\epsilon)-\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})-\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})\geq q\log(2q/\epsilon),

where the last inequality is by $\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}$ in Condition 4.2. This completes the proof. ∎

Lemma D.4.

Under the same conditions as Theorem 4.3, we have that

\displaystyle\|\mathbf{W}^{(t)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(t+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon

for all $T_{1}\leq t\leq T^{*}$ .

Proof of Lemma D.4.

We first apply a proof technique similar to Lemma 2.6 in Ji and Telgarsky (2020). The difference between our analysis and Ji and Telgarsky (2020) is that here the neural network is $q$ homogeneous rather than 1 homogeneous.

	$\displaystyle\\|\mathbf{W}^{(t)}-\mathbf{W}^{}\\|_{F}^{2}-\\|\mathbf{W}^{(t+1)}-\mathbf{W}^{}\\|_{F}^{2}$
	$\displaystyle=2\eta\langle\nabla L_{S}(\mathbf{W}^{(t)}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle=\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle]-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\geq\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-q\log(2q/\epsilon)]-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\geq\frac{2q\eta}{n}\sum_{i=1}^{n}[\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}-\epsilon/(2q)]-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon,$

where the first inequality is by Lemma D.3, the second inequality is due to the convexity of the cross entropy function, and the last inequality is due to Lemma C.7. ∎

Lemma D.5 (Restatement of Lemma 5.6).

Under the same conditions as Theorem 4.3, let $T=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}=T_{1}+\widetilde{O}(m^{3}\eta^{-1}\epsilon^{-1}\|\bm{\mu}\|_{2}^{-2})$ . Then we have $\max_{j,r,i}|\rho_{j,r,i}^{(t)}|\leq 2\widehat{\beta}=\sigma_{0}\sigma_{p}\sqrt{d}$ for all $T_{1}\leq t\leq T$ . Besides,

\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{2q-1}

for all $T_{1}\leq t\leq T$ , and we can find an iterate with training loss smaller than $\epsilon$ within $T$ iterations.

Proof of Lemma D.5.

By Lemma D.4, for any $t\in[T_{1},T]$ , we have that

\displaystyle\|\mathbf{W}^{(s)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(s+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(s)})-\eta\epsilon

holds for $s\leq t$ . Taking a summation, we obtain that

\displaystyle\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})

\displaystyle\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}+\eta\epsilon(t-T_{1}+1)}{(2q-1)\eta}

(D.7)

for all $T_{1}\leq t\leq T$ . Dividing $(t-T_{1}+1)$ on both side of (D.7) gives that

\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{2q-1}.

Then we can take $t=T$ and have that

\displaystyle\frac{1}{T-T_{1}+1}\sum_{s=T_{1}}^{T}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(T-T_{1}+1)}+\frac{\epsilon}{2q-1}\leq\frac{3\epsilon}{2q-1}<\epsilon,

where we use the fact that $q>2$ and our choice that $T=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}$ . Because the mean is smaller than $\epsilon$ , we can conclude that there exist $T_{1}\leq t\leq T$ such that $L_{S}(\mathbf{W}^{(t)})<\epsilon$ .

Finally, we will prove that $\max_{j,r,i}|\rho_{j,r,i}^{(t)}|\leq 2\widehat{\beta}$ for all $t\in[T_{1},T]$ . Plugging $T=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}$ into (D.7) gives that

\displaystyle\sum_{s=T_{1}}^{T}L_{S}(\mathbf{W}^{(s)})

\displaystyle\leq\frac{2\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta}=\widetilde{O}(\eta^{-1}m^{3}\|\bm{\mu}\|_{2}^{2}),

(D.8)

where the inequality is due to $\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}\leq\widetilde{O}(m^{3/2}\|\bm{\mu}\|_{2}^{-1})$ in Lemma D.2. Define $\Psi^{(t)}=\max_{j,r,i}|\rho_{j,r,i}^{(t)}|$ . We will use induction to prove $\Psi^{(t)}\leq 2\widehat{\beta}$ for all $t\in[T_{1},T]$ . At $t=T_{1}$ , by the definition of $\widehat{\beta}$ , clearly we have $\Psi^{(T_{1})}\leq\widehat{\beta}\leq 2\widehat{\beta}$ . Now suppose that there exists $\widetilde{T}\in[T_{1},T]$ such that $\Psi^{(t)}\leq 2\widehat{\beta}$ for all $t\in[T_{1},\widetilde{T}-1]$ . Then for $t\in[T_{1},\widetilde{T}-1]$ , by (C.4) and (C.5) we have

	$\displaystyle\Psi^{(t+1)}$	$\displaystyle\leq\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\|\ell_{i}^{\prime(t)}\|\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}}\Bigg{)}\cdot\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}\bigg{\}}$
		$\displaystyle=\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\|\ell_{i}^{\prime(t)}\|\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\Psi^{(t)}+2\sum_{i^{\prime}\neq i}^{n}\Psi^{(t)}\cdot\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}}\Bigg{)}\cdot\\|\bm{\xi}_{i^{\prime}}\\|_{2}^{2}\bigg{\}}$
		$\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\max_{i}\|\ell_{i}^{\prime(t)}\|\cdot\Bigg{[}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}$
		$\displaystyle\qquad+\Bigg{(}2+\frac{4n\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}}{\sigma_{p}^{2}d/2}\Bigg{)}\cdot\Psi^{(t)}\Bigg{]}^{q-1}\cdot 2\sigma_{p}^{2}d$
		$\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\max_{i}\|\ell_{i}^{\prime(t)}\|\cdot\big{(}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+4\cdot\Psi^{(t)}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d,$

where the second inequality is due to Lemmas B.2 and B.3, and the last inequality follows by the assumption that $d\geq 16n^{2}\log(4n^{2}/\delta)$ . Taking a telescoping sum over $t=0,1,\ldots,\widetilde{T}-1$ , we have that

	$\displaystyle\Psi^{(T)}$	$\displaystyle\overset{(i)}{\leq}\Psi^{(T_{1})}+\frac{\eta q}{nm}\sum_{s=T_{1}}^{\widetilde{T}-1}\max_{i}\|\ell_{i}^{\prime(s)}\|\widetilde{O}(\sigma_{p}^{2}d)\widehat{\beta}^{q-1}$
		$\displaystyle\overset{(ii)}{\leq}\Psi^{(T_{1})}+\frac{\eta q}{nm}\widetilde{O}(\sigma_{p}^{2}d)\widehat{\beta}^{q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}\max_{i}\ell_{i}^{(s)}$
		$\displaystyle\overset{(iii)}{\leq}\Psi^{(T_{1})}+\widetilde{O}(\eta m^{-1}\sigma_{p}^{2}d)\widehat{\beta}^{q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}L_{S}(\mathbf{W}^{(s)})$
		$\displaystyle\overset{(iv)}{\leq}\Psi^{(T_{1})}+\widetilde{O}(m^{2}\mathrm{SNR}^{-2})\widehat{\beta}^{q-1}$
		$\displaystyle\overset{(v)}{\leq}\widehat{\beta}+\widetilde{O}(m^{2}n^{2/q}\widehat{\beta}^{q-2})\widehat{\beta}$
		$\displaystyle\overset{(vi)}{\leq}2\widehat{\beta},$

where (i) is by out induction hypothesis that $\Psi^{(t)}\leq 2\widehat{\beta}$ , (ii) is by $|\ell^{\prime}|\leq\ell$ , (iii) is by $\max_{i}\ell_{i}^{(s)}\leq\sum_{i}\ell_{i}^{(s)}=nL_{S}(\mathbf{W}^{(s)})$ , (iv) is due to $\sum_{s=T_{1}}^{\widetilde{T}-1}L_{S}(\mathbf{W}^{(s)})\leq\sum_{s=T_{1}}^{T}L_{S}(\mathbf{W}^{(s)})=\widetilde{O}(\eta^{-1}m^{3}\|\bm{\mu}\|_{2}^{2})$ in (D.8), (v) is by $n\mathrm{SNR}^{q}\geq\widetilde{\Omega}(1)$ , and (vi) is by the definition that $\widehat{\beta}=\sigma_{0}\sigma_{p}\sqrt{d}/2$ and $\widetilde{O}(m^{2}n^{2/q}\widehat{\beta}^{q-2})=\widetilde{O}(m^{2}n^{2/q}(\sigma_{0}\sigma_{p}\sqrt{d})^{q-2})\leq 1$ by Condition 4.2. Therefore, $\Psi^{(\widetilde{T})}\leq 2\widehat{\beta}$ , which completes the induction. ∎

D.3 Population Loss

Consider a new data point $(\mathbf{x},y)$ drawn from the distribution defined in Definition 3.1. Without loss of generality, we suppose that the first patch is the signal patch and the second patch is the noise patch, i.e., $\mathbf{x}=[y\bm{\mu},\bm{\xi}]$ . Moreover, by the signal-noise decomposition, the learned neural network has parameter

\displaystyle\mathbf{w}_{j,r}^{(t)}

\displaystyle=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}

for $j\in\{\pm 1\}$ and $r\in[m]$ .

Lemma D.6.

Under the same conditions as Theorem 4.3, we have that $\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle|\leq 1/2$ for all $0\leq t\leq T$ .

Proof.

We can get the upper bound of the inner products between the parameter and the noise as follows:

	$\displaystyle\|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\|$	$\displaystyle\overset{(i)}{\leq}\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\|+\|\underline{\rho}_{j,r,i}^{(t)}\|+\|\overline{\rho}_{j,r,i}^{(t)}\|+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\overset{(ii)}{\leq}2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+\sigma_{0}\sigma_{p}\sqrt{d}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\overset{(iii)}{\leq}1/2$

for all $j\in\{\pm 1\}$ , $r\in[m]$ and $i\in[n]$ , where (i) is by Lemma C.3, (ii) is due to $|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\leq 2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}$ in Lemma B.3 and $\max_{j,r,i}|\rho_{j,r,i}^{(t)}|\leq\sigma_{0}\sigma_{p}\sqrt{d}$ in Lemma D.5, and (iii) is due to our condition of $\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot(\sigma_{p}\sqrt{d})^{-1}$ and $d\geq\widetilde{\Omega}(m^{2}n^{4})$ in Condition 4.2. ∎

Lemma D.7.

Under the same conditions as Theorem 4.3, with probability at least $1-4mT\cdot\exp(-C_{2}^{-1}\sigma_{0}^{-2}\sigma_{p}^{-2}d^{-1})$ , we have that $\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|\leq 1/2$ for all $0\leq t\leq T$ , where $C_{2}=\widetilde{O}(1)$ .

Proof of Lemma D.7.

Let $\widetilde{\mathbf{w}}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(t)}-j\cdot\gamma_{j,r}^{(t)}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}$ , then we have that $\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle=\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle$ and

\displaystyle\|\widetilde{\mathbf{w}}_{j,r}^{(t)}\|_{2}\leq\widetilde{O}(\sigma_{0}\sqrt{d}+n\sigma_{0})=\widetilde{O}(\sigma_{0}\sqrt{d}),

(D.9)

where the equality is due to $d\geq\widetilde{\Omega}(m^{2}n^{4})$ by Condition 4.2.

By (D.9), $\max_{j,r}\|\widetilde{\mathbf{w}}_{j,r}^{(t)}\|_{2}\leq C_{1}\sigma_{0}\sqrt{d}$ , where $C_{1}=\widetilde{O}(1)$ . Clearly $\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle$ is a Gaussian distribution with mean zero and standard deviation smaller than $C_{1}\sigma_{0}\sigma_{p}\sqrt{d}$ . Therefore, the probability is bounded by

\displaystyle\mathbb{P}\big{(}|\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle|\geq 1/2\big{)}\leq 2\exp\bigg{(}-\frac{1}{8C_{1}^{2}\sigma_{0}^{2}\sigma_{p}^{2}d}\bigg{)}.

Applying a union bound over $j,r,t$ completes the proof. ∎

Lemma D.8 (Restatement of Lemma 5.7).

Let $T$ be defined in Lemma 5.5 respectively. Under the same conditions as Theorem 4.3, for any $0\leq t\leq T$ with $L_{S}(\mathbf{W}^{(t)})\leq 1$ , it holds that $L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\cdot L_{S}(\mathbf{W}^{(t)})+\exp(-n^{2})$ .

Proof of Lemma D.8.

Let event $\mathcal{E}$ to be the event that Lemma D.7 holds. Then we can divide $L_{\mathcal{D}}(\mathbf{W}^{(t)})$ into two parts:

\displaystyle\mathbb{E}\big{[}\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}\big{]}

\displaystyle=\underbrace{\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}]}_{I_{1}}+\underbrace{\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E}^{c})\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}]}_{I_{2}}.

(D.10)

In the following, we bound $I_{1}$ and $I_{2}$ respectively.

Bounding $I_{1}$ : Since $L_{S}(\mathbf{W}^{(t)})\leq 1$ , there must exist one $(\mathbf{x}_{i},y_{i})$ such that $\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\leq L_{S}(\mathbf{W}^{(t)})\leq 1$ , which implies that $y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\geq 0$ . Therefore, we have that

\displaystyle\exp(-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))\overset{(i)}{\leq}2\log\big{(}1+\exp(-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))\big{)}=2\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\leq 2L_{S}(\mathbf{W}^{(t)}),

(D.11)

where (i) is by $z\leq 2\log(1+z),\forall z\leq 1$ . If event $\mathcal{E}$ holds, we have that

$\displaystyle\|yf(\mathbf{W}^{(t)},\mathbf{x})-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\|$	$\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}_{i}\rangle)+\frac{1}{m}\sum_{j,r}\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}\rangle)$
	$\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma(1/2)+\frac{1}{m}\sum_{j,r}\sigma(1/2)$
	$\displaystyle\leq 1,$	(D.12)

where the second inequality is by $\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|\leq 1/2$ in Lemma D.7 and $\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle|\leq 1/2$ in Lemma D.6. Thus we have that

	$\displaystyle I_{1}$	$\displaystyle\leq\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})\exp(-yf(\mathbf{W}^{(t)},\mathbf{x}))]$
		$\displaystyle\leq e\cdot\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})\exp(-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))]$
		$\displaystyle\leq 2e\cdot\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})L_{S}(\mathbf{W}^{(t)})],$

where the first inequality is by the property of cross-entropy loss that $\ell(z)\leq\exp(-z)$ for all $z$ , the second inequality is by (D.12), and the third inequality is by (D.11). Dropping the event in the expectation gives $I_{1}\leq 6L_{S}(\mathbf{W}^{(t)})$ .

Bounding $I_{2}$ : Next we bound the second term $I_{2}$ . We choose an arbitrary training data $(\mathbf{x}_{i^{\prime}},y_{i^{\prime}})$ such that $y_{i^{\prime}}=y$ . Then we have

$\displaystyle\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}$	$\displaystyle\leq\log(1+\exp(F_{-y}(\mathbf{W}^{(t)},\mathbf{x})))$
	$\displaystyle\leq 1+F_{-y}(\mathbf{W}^{(t)},\mathbf{x})$
	$\displaystyle=1+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}\rangle)+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)$
	$\displaystyle\leq 1+F_{-y_{i]}}(\mathbf{W}_{-y_{i^{\prime}}},\mathbf{x}_{i^{\prime}})+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)$
	$\displaystyle\leq 2+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)$
	$\displaystyle\leq 2+\widetilde{O}((\sigma_{0}\sqrt{d})^{q})\\|\bm{\xi}\\|^{q},$	(D.13)

where the first inequality is due to $F_{y}(\mathbf{W}^{(t)},\mathbf{x})\geq 0$ , the second inequality is by the property of cross-entropy loss, i.e., $\log(1+\exp(z))\leq 1+z$ for all $z\geq 0$ , the third inequality is by $\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}\rangle)\leq F_{-y}(\mathbf{W}_{-y},\mathbf{x}_{i^{\prime}})=F_{-y_{i^{\prime}}}(\mathbf{W}_{-y_{i^{\prime}}},\mathbf{x}_{i^{\prime}})$ , the fourth inequality is by $F_{-y_{i^{\prime}}}(\mathbf{W}_{-y_{i^{\prime}}},\mathbf{x}_{i^{\prime}})\leq 1$ in Lemma C.5, and the last inequality is due to $\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle=\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle\leq\|\widetilde{\mathbf{w}}_{j,r}^{(t)}\|_{2}\|\bm{\xi}\|_{2}\leq\widetilde{O}(\sigma_{0}\sqrt{d})\|\bm{\xi}\|_{2}$ in (D.9). Then we further have that

	$\displaystyle I_{2}$	$\displaystyle\leq\sqrt{\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E}^{c})]}\cdot\sqrt{\mathbb{E}\Big{[}\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}^{2}\Big{]}}$
		$\displaystyle\leq\sqrt{\mathbb{P}(\mathcal{E}^{c})}\cdot\sqrt{4+\widetilde{O}((\sigma_{0}\sqrt{d})^{2q})\mathbb{E}[\\|\bm{\xi}\\|_{2}^{2q}]}$
		$\displaystyle\leq\exp[-\widetilde{\Omega}(\sigma_{0}^{-2}\sigma_{p}^{-2}d^{-1})+\text{polylog}(d)]$
		$\displaystyle\leq\exp(-n^{2}),$

where the first inequality is by Cauchy-Schwartz inequality, the second inequality is by (D.13), the third inequality is by Lemma D.7 and the fact that $\sqrt{4+\widetilde{O}((\sigma_{0}\sqrt{d})^{2q})\mathbb{E}[\|\bm{\xi}\|_{2}^{2q}]}=O(\mathrm{poly}(d))$ , and the last inequality is by our condition $\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot(\sigma_{p}\sqrt{d})^{-1}$ in Condition 4.2. Plugging the bounds of $I_{1}$ , $I_{2}$ into (D.10) completes the proof. ∎

Appendix E Noise Memorization

In this section, we will consider the noise memorization case under the condition that $\sigma_{p}^{q}(\sqrt{d})^{q}\geq\widetilde{\Omega}(n\|\bm{\mu}\|_{2}^{q})$ . We remind the readers that the proofs in this section are based on the results in Section B, which hold with high probability.

We also remind readers that $\alpha=4\log(T^{*})$ is defined in Appendix C. Denote $\bar{\beta}=\min_{i}\max_{r}\langle\mathbf{w}_{y_{i},r}^{(0)},\bm{\xi}_{i}\rangle$ . The following lemma provides a lower bound of $\bar{\beta}$ .

Lemma E.1.

Under the same conditions as Theorem 4.4, if in particular

\displaystyle\sigma_{0}\geq 80n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\},

(E.1)

then we have that $\bar{\beta}\geq\sigma_{0}\sigma_{p}\sqrt{d}/4\geq 20n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$ .

Proof of Lemma E.1.

Because $\sigma_{p}^{q}(\sqrt{d})^{q}\geq\widetilde{\Omega}(n\|\bm{\mu}\|_{2}^{q})$ , we have that $\sigma_{p}\sqrt{d}\geq\|\bm{\mu}\|_{2}$ . Therefore we have that

	$\displaystyle\bar{\beta}$	$\displaystyle\geq\sigma_{0}\sigma_{p}\sqrt{d}/4$
		$\displaystyle=\sigma_{0}/4\cdot\max\{\sigma_{p}\sqrt{d},\\|\bm{\mu}\\|_{2}\}$
		$\displaystyle\geq 20n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,$

where the first inequality is by Lemma B.3 and the last inequality is by our lower bound condition of $\sigma_{0}$ in (E.1). ∎

E.1 First Stage

Lemma E.2.

Under the same conditions as Theorem 4.4, in particular if we choose

\displaystyle n^{-1}\mathrm{SNR}^{-q}\geq\frac{C2^{q+2}\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}(\sqrt{2\log(8m/\delta)})^{q-2}}{0.15^{q-2}},

(E.2)

where $C=O(1)$ is a positive constant, then there exist

\displaystyle T_{1}=\frac{C\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}4mn}{0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}

such that

•

$\max_{j,r}\overline{\rho}_{j,r,i}^{(T_{1})}\geq 2$ for all $i\in[n]$ .
•

$\max_{j,r}\gamma_{j,r}^{(t)}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ for all $0\leq t\leq T_{1}$ .
•

$\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(t)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$ for all $0\leq t\leq T_{1}$ .

Proof of Lemma E.2.

Let

\displaystyle T_{1}^{+}=\frac{m}{\eta q2^{q-1}(\sqrt{2\log(8m/\delta)})^{q-2}\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}.

(E.3)

By Proposition C.2, we have that $\underline{\rho}_{j,r,i}^{(t)}\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\geq-\beta-\bar{\beta}$ for all $j\in\{\pm 1\}$ , $r\in[m]$ , $i\in[n]$ and $0\leq t\leq T^{*}$ . Since $\underline{\rho}_{j,r,i}^{(t)}\leq 0$ and $\bar{\beta}\leq\beta=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$ , we have that $\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(t)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$ . Next, we will carefully compute the growth of the $\gamma_{j,r}^{(t)}$ .

	$\displaystyle\gamma_{j,r}^{(t+1)}$	$\displaystyle=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\\|\bm{\mu}\\|_{2}^{2}$
		$\displaystyle\leq\gamma_{j,r}^{(t)}+\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\sigma^{\prime}(\|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\|+\gamma_{j,r}^{(t)})\\|\bm{\mu}\\|_{2}^{2},$

where the inequality is by $|\ell^{\prime}|\leq 1$ . Let $A^{(t)}=\max_{j,r}\{\gamma_{j,r}^{(t)}+|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\}$ , then we have that

\displaystyle A^{(t+1)}\leq A^{(t)}+\frac{\eta q\|\bm{\mu}\|_{2}^{2}}{m}[A^{(t)}]^{q-1}.

(E.4)

We will use induction to prove that $A^{(t)}\leq 2A^{(0)}$ for $t\leq T_{1}^{+}$ . By definition, clearly we have that $A^{(0)}\leq 2A^{(0)}$ . Now suppose that there exists some $\widetilde{T}\leq T_{1}^{+}$ such that $A^{(t)}\leq 2A^{(0)}$ holds for $0\leq t\leq\widetilde{T}-1$ . Taking a telescoping sum of (E.4) gives that

	$\displaystyle A^{(\widetilde{T})}$	$\displaystyle\leq A^{(0)}+\sum_{s=0}^{\widetilde{T}}\frac{\eta q\\|\bm{\mu}\\|_{2}^{2}}{m}[A^{(s)}]^{q-1}$
		$\displaystyle\leq A^{(0)}+\frac{\eta q\\|\bm{\mu}\\|_{2}^{2}T_{1}^{+}2^{q-1}}{m}[A^{(0)}]^{q-1}$
		$\displaystyle\leq A^{(0)}+\frac{\eta q\\|\bm{\mu}\\|_{2}^{2}T_{1}^{+}2^{q-1}}{m}[\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\\|\bm{\mu}\\|_{2}]^{q-2}A^{(0)}$
		$\displaystyle\leq 2A^{(0)},$

where the second inequality is by our induction hypothesis, the third inequality is by $A_{0}\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2}$ in Lemma B.3, and the last inequality is by (E.3). Thus we have that $A^{(t)}\leq 2A^{(0)}$ for all $t\leq T_{1}^{+}$ . Therefore, $\max_{j,r}\gamma_{j,r}^{(t)}\leq A^{(t)}+\max_{j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\}\leq 3A^{(0)}$ for all $0\leq t\leq T_{1}^{+}$ . Recall that

\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}

\displaystyle=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}^{(t)}_{j,r},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\|\bm{\xi}_{i}\|_{2}^{2}.

For $y_{i}=j$ , Lemma C.4 implies that

	$\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle$	$\displaystyle\geq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$
		$\displaystyle\geq\overline{\rho}^{(t)}_{j,r,i}+\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle-0.4\bar{\beta},$

where the last inequality is by $\bar{\beta}\geq 20n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha$ . Now let $B_{i}^{(t)}=\max_{j=y_{i},r}\{\overline{\rho}_{j,r,i}^{(t)}+\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle-0.4\bar{\beta}\}$ . For each $i$ , denote by $T_{1}^{(i)}$ the last time in the period $[0,T_{1}^{+}]$ satisfying that $\overline{\rho}_{j,r,i}^{(t)}\leq 2$ . Then for $t\leq T_{1}^{(i)}$ , $\max_{j,r}\{|\overline{\rho}_{j,r,i}^{(t)}|,|\underline{\rho}_{j,r,i}^{(t)}|\}=O(1)$ and $\max_{j,r}\gamma_{j,r}^{(t)}\leq 3A^{(0)}=O(1)$ . Therefore, by Lemmas C.5 and C.6, we know that $F_{-1}(\mathbf{W}^{(t)},\mathbf{x}_{i}),F_{+1}(\mathbf{W}^{(t)},\mathbf{x}_{i})=O(1)$ . Thus there exists a positive constant $C_{1}$ such that $-\ell^{\prime(t)}_{i}\geq C_{1}$ for all $0\leq t\leq T_{1}^{(i)}$ . It is also easy to check that $B_{i}^{(0)}\geq 0.6\bar{\beta}\geq 0.15\sigma_{0}\sigma_{p}\sqrt{d}$ . Then we can carefully compute the growth of $B_{i}^{(t)}$ ,

	$\displaystyle B_{i}^{(t+1)}$	$\displaystyle\geq B_{i}^{(t)}+\frac{C_{1}\eta q\sigma_{p}^{2}d}{2nm}[B_{i}^{(t)}]^{q-1}$
		$\displaystyle\geq B_{i}^{(t)}+\frac{C_{1}\eta q\sigma_{p}^{2}d}{2nm}[B_{i}^{(0)}]^{q-2}B_{i}^{(t)}$
		$\displaystyle\geq\bigg{(}1+\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{2nm}\bigg{)}B_{i}^{(t)},$

where the second inequality is by the non-decreasing property of $B_{i}^{(t)}$ . Therefore, $B_{i}^{(t)}$ is an exponentially increasing sequence and we have that

	$\displaystyle B_{i}^{(t)}$	$\displaystyle\geq\bigg{(}1+\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{2nm}\bigg{)}^{t}B_{i}^{(0)}$
		$\displaystyle\geq\exp\bigg{(}\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{4nm}t\bigg{)}B_{i}^{(0)}$
		$\displaystyle\geq\exp\bigg{(}\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{4nm}t\bigg{)}\cdot 0.15\sigma_{0}\sigma_{p}\sqrt{d},$

where the second inequality is due to the fact that $1+z\geq\exp(z/2)$ for $z\leq 2$ and our conditions of $\eta$ and $\sigma_{0}$ listed in Condition 4.2, and the last inequality is due to $B_{i}^{(0)}\geq 0.15\sigma_{0}\sigma_{p}\sqrt{d}$ . Therefore, $B_{i}^{(t)}$ will reach $3$ within $T_{1}=\frac{\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}4mn}{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}$ iterations. Since $\max_{j=y_{i},r}\overline{\rho}_{j,r,i}^{(t)}\geq B_{i}^{(t)}-\max_{j=y_{i},r}|\langle\mathbf{w}_{j,r}^{(0)},\xi_{i}\rangle|+0.4\bar{\beta}\geq B_{i}^{(t)}-1$ , $\max_{j=y_{i},r}\overline{\rho}_{j,r,i}^{(t)}$ will reach $2$ within $T_{1}$ iterations. We can next verify that

\displaystyle T_{1}=\frac{\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}4mn}{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}\leq\frac{m}{\eta q2^{q}(\sqrt{2\log(8m/\delta)})^{q-2}\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}=T_{1}^{+}/2,

where the inequality holds due to our SNR condition in (E.2). Therefore, by the definition of $T_{1}^{(i)}$ , we have $T_{1}^{(i)}\leq T_{1}\leq T_{1}^{+}/2$ , where we use the non-decreasing property of $\overline{\rho}_{j,r,i}$ . This completes the proof. ∎

E.2 Second Stage

By the signal-noise decompositon, at the end of the first stage, we have

\displaystyle\mathbf{w}_{j,r}^{(T_{1})}

\displaystyle=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(T_{1})}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}

for $j\in\{\pm 1\}$ and $r\in[m]$ . By the results we get in the first stage, we know that at the beginning of this stage, we have following property holds:

•

$\max_{r}\overline{\rho}_{y_{i},r,i}^{(T_{1})}\geq 2$ for all $i\in[n]$ .
•

$\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(T_{1})}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$ .
•

$\max_{j,r}\gamma_{j,r}^{(T_{1})}\leq\widehat{\beta}^{\prime}$ , where $\widehat{\beta}^{\prime}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ .

Note that Lemma 5.1 implies that the learned noise $\overline{\rho}_{j,r,i}^{(t)}$ will not decrease, i.e., $\overline{\rho}_{j,r,i}^{(t+1)}\geq\overline{\rho}_{j,r,i}^{(t)}$ . Therefore, for all data index $i$ , we have $\max_{r}\overline{\rho}_{y_{i},r,i}^{(t)}\geq 2$ for all $t\geq T_{1}$ . Now we choose $\mathbf{W}^{*}$ as follows

\displaystyle\mathbf{w}^{*}_{j,r}=\mathbf{w}_{j,r}^{(0)}+2qm\log(2q/\epsilon))\bigg{[}\sum_{i=1}^{n}\operatorname{\mathds{1}}(j=y_{i})\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}}\bigg{]}.

Based on the definition of $\mathbf{W}^{*}$ , we have the following lemma.

Lemma E.3.

Under the same conditions as Theorem 4.4, we have that $\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}\leq\widetilde{O}(m^{2}n^{1/2}\sigma_{p}^{-1}d^{-1/2})$ .

Proof of Lemma E.3.

We have

	$\displaystyle\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\\|_{F}$	$\displaystyle\leq\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\\|_{F}+\\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\\|_{F}$
		$\displaystyle\leq\sum_{j,r}\gamma_{j,r}^{(T_{1})}\\|\bm{\mu}\\|_{2}^{-1}+O(\sqrt{m})\max_{j,r}\bigg{\\|}\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\\|\bm{\xi}_{i}\\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\\|\bm{\xi}_{i}\\|_{2}^{2}}\bigg{\\|}_{2}$
		$\displaystyle\qquad+O(m^{3/2}n^{1/2}\log(1/\epsilon)\sigma_{p}^{-1}d^{-1/2})$
		$\displaystyle\leq\widetilde{O}(m^{3/2}n^{1/2}\sigma_{p}^{-1}d^{-1/2}),$

where the first inequality is by triangle inequality, the second inequality is by our decomposition of $\mathbf{W}^{(T_{1})},\mathbf{W}^{*}$ and Lemma B.2 (notice that different noises are almost orthogonal), and the last inequality is by Proposition C.2 and Lemma E.2. This completes the proof. ∎

Lemma E.4.

Under the same conditions as Theorem 4.4, we have that

\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle\geq q\log(2q/\epsilon)

for all $T_{1}\leq t\leq T^{*}$ .

Proof of Lemma E.4.

	$\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle$
	$\displaystyle=\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{}\rangle+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{}\rangle$
	$\displaystyle=\frac{1}{m}\sum_{j,r}\sum_{i^{\prime}=1}^{n}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)\operatorname{\mathds{1}}(j=y_{i^{\prime}})\cdot\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}}$
	$\displaystyle\qquad+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{(0)}\rangle+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{(0)}\rangle$
	$\displaystyle\geq\frac{1}{m}\sum_{j=y_{i},r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)-\frac{1}{m}\sum_{j,r}\sum_{i^{\prime}\not=i}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)\frac{\|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle\|}{\\|\bm{\xi}_{i^{\prime}}\\|_{2}}$
	$\displaystyle\qquad-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\widetilde{O}(\sigma_{0}\\|\bm{\mu}\\|_{2})-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$
	$\displaystyle\geq\frac{1}{m}\sum_{j=y_{i},r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(mnd^{-1/2})$
	$\displaystyle\qquad-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\widetilde{O}(\sigma_{0}\\|\bm{\mu}\\|_{2})-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}),$		(E.5)

where the first inequality is by Lemma B.3 and the last inequality is by Lemma B.2. Next we will bound the inner-product terms in (D.4) respectively. By Lemma C.6, we have that

\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle|\leq|\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle|+\gamma_{j,r}^{(t)}\leq\widetilde{O}(1),

(E.6)

where the last inequality is by Proposition C.2.

For $j=y_{i}$ , we can bound the inner product between the parameter and the noise as follows

\displaystyle\max_{j,r}\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle

\displaystyle\geq\max_{j,r}\big{[}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}\big{]}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\geq 1,

(E.7)

where the first inequality is by Lemma C.4, the second inequality is by Lemma E.2.

For $j=-y_{i}$ , we can bound the inner product between the parameter and the noise as follows

\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle

\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 1,

(E.8)

where the first inequality is by Lemma C.5 and the last inequality is by Lemma B.3 and the conditions of $\sigma_{0}$ and $d$ in Condition 4.2. Therefore, plugging (E.6), (E.7), (E.8) into (E.5) gives

	$\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle$	$\displaystyle\geq 2q\log(2q/\epsilon))-\widetilde{O}(mnd^{-1/2})-\widetilde{O}(\sigma_{0}\\|\bm{\mu}\\|_{2})-\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$
		$\displaystyle\geq q\log(2q/\epsilon),$

where the last inequality is by $d\geq\widetilde{\Omega}(m^{2}n^{4})$ and $\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}$ in Condition 4.2. ∎

Lemma E.5.

Under the same conditions as Theorem 4.4, we have that

\displaystyle\|\mathbf{W}^{(t)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(t+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon

for all $T_{1}\leq t\leq T^{*}$ .

Proof of Lemma E.5.

The proof is exactly same as the proof of Lemma D.4.

	$\displaystyle\\|\mathbf{W}^{(t)}-\mathbf{W}^{}\\|_{F}^{2}-\\|\mathbf{W}^{(t+1)}-\mathbf{W}^{}\\|_{F}^{2}$
	$\displaystyle\qquad=2\eta\langle\nabla L_{S}(\mathbf{W}^{(t)}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\qquad=\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle]-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\qquad\geq\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-q\log(6/\epsilon)]-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\qquad\geq\frac{2q\eta}{n}\sum_{i=1}^{n}[\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}-\epsilon/(2q)]-\eta^{2}\\|\nabla L_{S}(\mathbf{W}^{(t)})\\|_{F}^{2}$
	$\displaystyle\qquad\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon,$

where the first inequality is by Lemma E.4, the second inequality is due to the convexity of the cross entropy function and the last inequality is due to Lemma C.7. ∎

Lemma E.6.

Under the same conditions as Theorem 4.4, let $T=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}=T_{1}+\widetilde{O}(\eta^{-1}\epsilon^{-1}m^{3}nd^{-1}\sigma_{p}^{-2})$ . Then we have $\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}$ , $\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(t)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})$ for all $T_{1}\leq t\leq T$ . Besides,

\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{(2q-1)}

for all $T_{1}\leq t\leq T$ , and we can find an iterate with training loss smaller than $\epsilon$ within $T$ iterations.

Proof of Lemma E.6.

By Lemma E.5, for any $T_{1}\leq t\leq T$ , we obtain that

\displaystyle\|\mathbf{W}^{(s)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(s+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(s)})-\eta\epsilon

(E.9)

holds for $T_{1}\leq s\leq t$ . Taking a summation, we have that

$\displaystyle\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})$	$\displaystyle\leq\frac{\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\\|_{F}^{2}+\eta\epsilon(t-T_{1}+1)}{(2q-1)\eta}$
	$\displaystyle\overset{(i)}{\leq}\frac{2\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\\|_{F}^{2}}{(2q-1)\eta}$
	$\displaystyle\overset{(ii)}{=}\widetilde{O}(\eta^{-1}m^{3}nd^{-1}\sigma_{p}^{-2}),$	(E.10)

where (i) is by $t\leq T_{2}$ and (ii) is by Lemma E.3 Then we can use induction to prove that $\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}$ for all $t\in[T_{1},T]$ . Clearly, by the definition of $\widehat{\beta}^{\prime}$ , we have $\max_{j,r}\gamma_{j,r}^{(T_{1})}\leq\widehat{\beta}^{\prime}\leq 2\widehat{\beta}^{\prime}$ . Now suppose that there exists $\widetilde{T}\in[T_{1},T]$ such that $\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}$ for all $t\in[T_{1},\widetilde{T}-1]$ . Then by (C.3), we have

	$\displaystyle\gamma_{j,r}^{(\widetilde{T})}$	$\displaystyle=\gamma_{j,r}^{(T_{1})}-\frac{\eta}{nm}\sum_{s=T_{1}}^{\widetilde{T}-1}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\\|\bm{\mu}\\|_{2}^{2},$
		$\displaystyle\overset{(i)}{\leq}\gamma_{j,r}^{(T_{1})}+\frac{q3^{q-1}\eta}{nm}\\|\bm{\mu}\\|_{2}^{2}\widehat{\beta}^{\prime q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}\sum_{i=1}^{n}\|\ell_{i}^{\prime(t)}\|$
		$\displaystyle\overset{(ii)}{\leq}\gamma_{j,r}^{(T_{1})}+q3^{q-1}\eta m^{-1}\\|\bm{\mu}\\|_{2}^{2}\widehat{\beta}^{\prime q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}L_{S}(\mathbf{W}^{(s)})$
		$\displaystyle\overset{(iii)}{\leq}\gamma_{j,r}^{(T_{1})}+\widehat{\beta}^{\prime q-1}\widetilde{O}(m^{2}n\mathrm{SNR}^{2})$
		$\displaystyle\overset{(iv)}{\leq}\gamma_{j,r}^{(T_{1})}+\widehat{\beta}^{\prime q-1}\widetilde{O}(m^{2}n^{1-2/q})$
		$\displaystyle\overset{(v)}{\leq}2\widehat{\beta}^{\prime}$

for all $j\in\{\pm 1\}$ and $r\in[m]$ , where (i) is by induction hypothesis $\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}$ , (ii) is by $|\ell^{\prime}|\leq\ell$ , (iii) is by (E.10), (iv) is by $n^{-1}\mathrm{SNR}^{-q}\geq\widetilde{\Omega}(1)$ , and $(v)$ is by $\widehat{\beta}^{\prime}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ and $\widehat{\beta}^{\prime q-2}\widetilde{O}(m^{2}n^{1-2/q})=\widetilde{O}(m^{2}n^{1-2/q}(\sigma_{0}\|\bm{\mu}\|_{2})^{q-2})\leq 1$ by Condition 4.2. Therefore, we have $\max_{j,r}\gamma_{j,r}^{(\widetilde{T})}\leq 2\widehat{\beta}^{\prime}$ , which completes the induction. ∎

E.3 Population Loss

Lemma E.7 (4th statement of Theorem 4.4).

Under the same conditions as Theorem 4.4, within $\widetilde{O}(\eta^{-1}n\sigma_{0}^{2-q}\sigma_{p}^{-q}d^{-q/2}+\eta^{-1}\epsilon^{-1}m^{3}n\sigma_{p}^{-2}d^{-1})$ iterations, we can find $\mathbf{W}^{(T)}$ such that $L_{S}(\mathbf{W}^{(T)})\leq\epsilon$ . Besides, for any $0\leq t\leq T$ we have that $L_{\mathcal{D}}(\mathbf{W}^{(t)})\geq 0.1$ .

Proof of Lemma E.7.

Given a new example $(x,y)$ , we have that

	$\displaystyle\\|\mathbf{w}_{j,r}^{(t)}\\|_{2}$	$\displaystyle=\bigg{\\|}\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\frac{\bm{\mu}}{\\|\bm{\mu}\\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\\|\bm{\xi}_{i}\\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\\|\bm{\xi}_{i}\\|_{2}^{2}}\bigg{\\|}_{2}$
		$\displaystyle\overset{(i)}{\leq}\\|\mathbf{w}_{j,r}^{(0)}\\|_{2}+\frac{\gamma_{j,r}^{(t)}}{\\|\bm{\mu}\\|_{2}}+\sum_{i=1}^{n}\frac{\overline{\rho}_{j,r,i}^{(t)}}{\\|\bm{\xi}_{i}\\|_{2}}+\sum_{i=1}^{n}\frac{\|\underline{\rho}_{j,r,i}^{(t)}\|}{\\|\bm{\xi}_{i}\\|_{2}}$
		$\displaystyle\overset{(ii)}{=}O(\sigma_{0}\sqrt{d})+\widetilde{O}(n\sigma_{p}^{-1}d^{-1/2}),$

where (i) is by triangle inequality and (ii) is by $\max_{j,r}\gamma_{j,r}^{(t)}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ in Lemma E.6 and $\max_{i,j,r}|\rho_{j,r,i}|\leq 4\log(T^{*})$ in Proposition 5.3.

Therefore, we have that $\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle\sim\mathcal{N}(0,\sigma_{p}^{2}\|\mathbf{w}_{j,r}^{(t)}\|_{2}^{2})$ . So with probability $1-1/(4m)$ ,

\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|\leq\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}+nd^{-1/2}).

Since the signal vector $\bm{\mu}$ is orthogonal to noises, by $\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ in Lemma E.6, we also have that $|\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle|\leq|\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle|+\gamma_{j,r}^{(t)}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})$ . Now by union bound, with probability at least $1-1/2$ , we have that

	$\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x})$	$\displaystyle=\frac{1}{m}\sum_{r=1}^{m}\sigma(\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}\rangle)+\frac{1}{m}\sum_{r=1}^{m}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)$
		$\displaystyle\leq\max_{r}\|\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle\|^{q}+\max_{r}\|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle\|^{q}$
		$\displaystyle\leq\widetilde{O}(\sigma_{0}^{q}\sigma_{p}^{q}d^{q/2}+n^{q}d^{-q/2}+\sigma_{0}^{q}\\|\bm{\mu}\\|_{2}^{2})$
		$\displaystyle\leq 1,$

where the last inequality is by $\sigma_{0}\leq\widetilde{O}(n^{-1}m^{-2/(q-2)}\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\})$ and $d\geq\widetilde{\Omega}(m^{2}n^{4})$ in Condition 4.2. Therefore, with probability at least $1-1/2$ , we have that

\displaystyle\ell\big{(}y\cdot f(\mathbf{W}^{(t)},\mathbf{x})\big{)}\geq\log(1+e^{-1}).

Thus $L_{\mathcal{D}}(\mathbf{W}^{(t)})\geq\log(1+e^{-1})\cdot 0.5\geq 0.1$ . This completes the proof. ∎

References

Adlam and Pennington (2020) Adlam, B. and Pennington, J. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. PMLR.
Allen-Zhu and Li (2020a) Allen-Zhu, Z. and Li, Y. (2020a). Feature purification: How adversarial training performs robust deep learning. arXiv preprint arXiv:2005.10190 .
Allen-Zhu and Li (2020b) Allen-Zhu, Z. and Li, Y. (2020b). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 .
Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2019a). Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems.
Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y. and Song, Z. (2019b). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z. and Wang, R. (2019a). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning.
Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. and Wang, R. (2019b). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B. and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning.
Bartlett et al. (2017) Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems.
Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences .
Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116 15849–15854.
Belkin et al. (2019b) Belkin, M., Hsu, D. and Xu, J. (2019b). Two models of double descent for weak features. arXiv preprint arXiv:1903.07571 .
Belkin et al. (2018) Belkin, M., Ma, S. and Mandal, S. (2018). To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning.
Bousquet and Elisseeff (2002) Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of machine learning research 2 499–526.
Brutzkus et al. (2018) Brutzkus, A., Globerson, A., Malach, E. and Shalev-Shwartz, S. (2018). Sgd learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations.
Cao and Gu (2019a) Cao, Y. and Gu, Q. (2019a). Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems.
Cao and Gu (2019b) Cao, Y. and Gu, Q. (2019b). Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems.
Cao et al. (2021) Cao, Y., Gu, Q. and Belkin, M. (2021). Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. Advances in Neural Information Processing Systems 34.
Chatterji and Long (2020) Chatterji, N. S. and Long, P. M. (2020). Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. arXiv preprint arXiv:2004.12019 .
Chen et al. (2018) Chen, Y., Jin, C. and Yu, B. (2018). Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619 .
Chen et al. (2019) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2019). How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360 .
Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
Du et al. (2018a) Du, S. S., Lee, J. D. and Tian, Y. (2018a). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
Du et al. (2018b) Du, S. S., Lee, J. D., Tian, Y., Singh, A. and Poczos, B. (2018b). Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In International Conference on Machine Learning.
Du et al. (2019b) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
Frei et al. (2021) Frei, S., Cao, Y. and Gu, Q. (2021). Provable generalization of sgd-trained neural networks of any width in the presence of adversarial label noise. In International Conference on Machine Learning. PMLR.
Frei et al. (2022) Frei, S., Chatterji, N. S. and Bartlett, P. L. (2022). Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. arXiv preprint arXiv:2202.05928 .
Golowich et al. (2018) Golowich, N., Rakhlin, A. and Shamir, O. (2018). Size-independent sample complexity of neural networks. In Conference On Learning Theory.
Hardt et al. (2016) Hardt, M., Recht, B. and Singer, Y. (2016). Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR. org.
Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 .
Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
Ji and Telgarsky (2020) Ji, Z. and Telgarsky, M. (2020). Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations.
Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems.
Li and Liang (2018) Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems.
Li et al. (2019) Li, Y., Wei, C. and Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems.
Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems.
Li et al. (2021) Li, Z., Zhou, Z.-H. and Gretton, A. (2021). Towards an understanding of benign overfitting in neural networks. arXiv preprint arXiv:2106.03212 .
Liang and Rakhlin (2020) Liang, T. and Rakhlin, A. (2020). Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics 48 1329–1347.
Liao et al. (2020) Liao, Z., Couillet, R. and Mahoney, M. (2020). A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020).
Lin et al. (2013) Lin, M., Chen, Q. and Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400 .
Lyu and Li (2019) Lyu, K. and Li, J. (2019). Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890 .
Mei and Montanari (2019) Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355 .
Montanari and Zhong (2020) Montanari, A. and Zhong, Y. (2020). The interpolation phase transition in neural networks: Memorization and generalization under lazy training. arXiv preprint arXiv:2007.12826 .
Mou et al. (2017) Mou, W., Wang, L., Zhai, X. and Zheng, K. (2017). Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947 .
Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., McAllester, D. and Srebro, N. (2018). A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representation.
Neyshabur et al. (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. and Srebro, N. (2019). Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations.
Neyshabur et al. (2015) Neyshabur, B., Tomioka, R. and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory.
Shamir (2022) Shamir, O. (2022). The implicit bias of benign overfitting. arXiv preprint arXiv:2201.11489 .
Soltanolkotabi (2017) Soltanolkotabi, M. (2017). Learning ReLUs via gradient descent. In Advances in Neural Information Processing Systems.
Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 2822–2878.
Wu and Xu (2020) Wu, D. and Xu, J. (2020). On the optimal weighted $\ell_{2}$ regularization in overparameterized linear regression. Advances in Neural Information Processing Systems 33.
Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
Zhang et al. (2019) Zhang, X., Yu, Y., Wang, L. and Gu, Q. (2019). Learning one-hidden-layer ReLU networks via gradient descent. In The 22nd International Conference on Artificial Intelligence and Statistics.
Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L. and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. In International Conference on Machine Learning.
Zou et al. (2021a) Zou, D., Cao, Y., Li, Y. and Gu, Q. (2021a). Understanding the generalization of adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371 .
Zou et al. (2019) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2019). Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning .
Zou et al. (2021b) Zou, D., Wu, J., Braverman, V., Gu, Q. and Kakade, S. (2021b). Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory. PMLR.

	$\displaystyle\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\\|_{F}$	$\displaystyle\leq\\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\\|_{F}+\\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\\|_{F}$
		$\displaystyle\leq\sum_{j,r}\gamma_{j,r}^{(T_{1})}\\|\bm{\mu}\\|_{2}^{-1}+\sum_{j,r,i}\frac{\overline{\rho}_{j,r,i}^{(T_{1})}}{\\|\bm{\xi}_{i}\\|_{2}}+\sum_{j,r,i}\frac{\underline{\rho}_{j,r,i}^{(T_{1})}}{\\|\bm{\xi}_{i}\\|_{2}}+\Theta(m^{3/2}\log(1/\epsilon))\\|\bm{\mu}\\|_{2}^{-1}$
		$\displaystyle=\widetilde{O}(m^{3/2}\\|\bm{\mu}\\|_{2}^{-1}),$

	$\displaystyle\mathbb{P}\big{(}\sigma_{0}\\|\bm{\mu}\\|_{2}/2\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle)$	$\displaystyle=1-\mathbb{P}(\sigma_{0}\\|\bm{\mu}\\|_{2}/2>\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\big{)}$
		$\displaystyle=1-\mathbb{P}\big{(}\sigma_{0}\\|\bm{\mu}\\|_{2}/2>j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\big{)}^{2m}$
		$\displaystyle\geq 1-\delta/4.$

	$\displaystyle\gamma_{j,r}^{(0)},\overline{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j,r,i}^{(0)}=0,$
	$\displaystyle\gamma_{j,r}^{(t+1)}=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\\|\bm{\mu}\\|_{2}^{2},$
	$\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j),$
	$\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\\|\bm{\xi}_{i}\\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j)$

	$\displaystyle\\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\\|_{F}$	$\displaystyle\leq\frac{1}{m}\sum_{j,r}\bigg{\\|}\big{[}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle)\bm{\mu}+\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\bm{\xi}_{i}\big{]}\bigg{\\|}_{2}$
		$\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle)\\|\bm{\mu}\\|_{2}+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\\|\bm{\xi}_{i}\\|_{2}$
		$\displaystyle\leq 2q\bigg{[}F_{+1}(\mathbf{W}^{(t)}_{+1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}$
		$\displaystyle\qquad+2q\bigg{[}F_{-1}(\mathbf{W}^{(t)}_{-1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}$
		$\displaystyle\leq 2q\bigg{[}F_{+1}(\mathbf{W}^{(t)}_{+1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}+2q\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\},$

	$\displaystyle-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\cdot\\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\\|_{F}^{2}$
	$\displaystyle\qquad\leq-\ell^{\prime}(A-1)\bigg{(}2q\cdot A^{(q-1)/q}\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}+2q\cdot\max\{\\|\bm{\mu}\\|_{2},2\sigma_{p}\sqrt{d}\}\bigg{)}^{2}$
	$\displaystyle\qquad=-4q^{2}\ell^{\prime}(A-1)(A^{(q-1)/q}+1)^{2}\cdot\max\{\\|\bm{\mu}\\|_{2}^{2},4\sigma_{p}^{2}d\}$
	$\displaystyle\qquad\leq\big{(}\max_{z>0}-4q^{2}\ell^{\prime}(z-1)(z^{(q-1)/q}+1)^{2}\big{)}\max\{\\|\bm{\mu}\\|_{2}^{2},4\sigma_{p}^{2}d\}$
	$\displaystyle\qquad\overset{(i)}{=}O(\max\{\\|\bm{\mu}\\|_{2}^{2},\sigma_{p}^{2}d\}),$

Benign Overfitting in Two-layer Convolutional Neural Networks

Abstract

1 Introduction

Notation.

2 Related Work

3 Problem Setup

Definition 3.1.

4 Main Results

Definition 4.1.

Condition 4.2.

Theorem 4.3.

Theorem 4.4.

5 Overview of Proof Technique

5.1 Iterative Analysis of the Signal-Noise Decomposition

Lemma 5.1.

Remark 5.2.

Proposition 5.3.

Lemma 5.4.

5.2 Decoupling with a Two-Stage Analysis.

Lemma 5.5.

Lemma 5.6.

Lemma 5.7.

Proof of Theorem 4.3.

6 Conclusion and Future Work

Acknowledgements

Appendix A Additional Related Work

Appendix B Preliminary Lemmas

Lemma B.1.

Proof of Lemma B.1.

Lemma B.2.

Proof of Lemma B.2.

Lemma B.3.

Proof of Lemma B.3.

Appendix C Signal-noise Decomposition Analysis

Lemma C.1 (Restatement of Lemma 5.1).

Proof of Lemma C.1.

Proposition C.2 (Restatement of Proposition 5.3).

Lemma C.3.

Proof of Lemma C.3.

Lemma C.4.

Proof of Lemma C.4.

Lemma C.5.

Proof of Lemma C.5.

Lemma C.6.

Proof of Lemma C.6.

Proof of Proposition C.2.

Lemma C.7 (Restatement of Lemma 5.4).

Proof of Lemma C.7.

Appendix D Signal Learning

D.1 First stage

Lemma D.1 (Restatement of Lemma 5.5).

Proof of Lemma D.1.

D.2 Second Stage

Lemma D.2.

Proof of Lemma D.2.

Lemma D.3.

Proof of Lemma D.3.

Lemma D.4.

Proof of Lemma D.4.

Lemma D.5 (Restatement of Lemma 5.6).

Proof of Lemma D.5.

D.3 Population Loss

Lemma D.6.

Proof.

Lemma D.7.

Proof of Lemma D.7.

Lemma D.8 (Restatement of Lemma 5.7).

Proof of Lemma D.8.

Appendix E Noise Memorization

Lemma E.1.

Proof of Lemma E.1.

E.1 First Stage

Lemma E.2.

Proof of Lemma E.2.

E.2 Second Stage

Lemma E.3.

Proof of Lemma E.3.

Lemma E.4.

Proof of Lemma E.4.

Lemma E.5.

Benign Overfitting in Two-layer Convolutional
Neural Networks