This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Benign Overfitting in Two-layer Convolutional
Neural Networks

Yuan Cao    and    Zixiang Chen11footnotemark: 1    and    Mikhail Belkin    and    Quanquan Gu Equal contributionDepartment of Statistics and Actuarial Science and Department of Mathematics, The University of Hong Kong, Hong Kong; e-mail: [email protected]Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]Halicioğlu Data Science Institute, University of California, San Diego, La Jolla, CA 92093, USA; e-mail: [email protected]Department of Computer Science, University of California, Los Angeles, CA 90095, USA; e-mail: [email protected]
Abstract

Modern neural networks often have great expressive power and can be trained to overfit the training data, while still achieving a good test performance. This phenomenon is referred to as “benign overfitting”. Recently, there emerges a line of works studying “benign overfitting” from the theoretical perspective. However, they are limited to linear models or kernel/random feature models, and there is still a lack of theoretical understanding about when and how benign overfitting occurs in neural networks. In this paper, we study the benign overfitting phenomenon in training a two-layer convolutional neural network (CNN). We show that when the signal-to-noise ratio satisfies a certain condition, a two-layer CNN trained by gradient descent can achieve arbitrarily small training and test loss. On the other hand, when this condition does not hold, overfitting becomes harmful and the obtained CNN can only achieve constant level test loss. These together demonstrate a sharp phase transition between benign overfitting and harmful overfitting, driven by the signal-to-noise ratio. To the best of our knowledge, this is the first work that precisely characterizes the conditions under which benign overfitting can occur in training convolutional neural networks.

1 Introduction

Modern deep learning models often consist of a huge number of model parameters, which is more than the number of training data points and therefore over-parameterized. These over-parameterized models can be trained to overfit the training data (achieving a close to 100%100\% training accuracy), while still making accurate prediction on the unseen test data. This phenomenon has been observed in a number of prior works (Zhang et al., 2017; Neyshabur et al., 2019), and is often referred to as benign overfitting (Bartlett et al., 2020). It revolutionizes the the classical understanding about the bias-variance trade-off in statistical learning theory, and has drawn great attention from the community (Belkin et al., 2018, 2019a, 2019b; Hastie et al., 2019).

There exist a number of works towards understanding the benign overfitting phenomenon. While they offered important insights into the benign overfitting phenomenon, most of them are limited to the settings of linear models (Belkin et al., 2019b; Bartlett et al., 2020; Hastie et al., 2019; Wu and Xu, 2020; Chatterji and Long, 2020; Zou et al., 2021b; Cao et al., 2021) and kernel/random features models (Belkin et al., 2018; Liang and Rakhlin, 2020; Montanari and Zhong, 2020), and cannot be applied to neural network models that are of greater interest. The only notable exceptions are (Adlam and Pennington, 2020; Li et al., 2021), which attempted to understand benign overfitting in neural network models. However, they are still limited to the “neural tagent kernel regime” (Jacot et al., 2018) where the neural network learning problem is essentially equivalent to kernel regression. Thus, it remains a largely open problem to show how and when benign overfitting can occur in neural networks.

Clearly, understanding benign overfitting in neural networks is much more challenging than that in linear models, kernel methods or random feature models. The foremost challenge stems from nonconvexity: previous works on linear models and kernel methods/random features are all in the convex setting, while neural network training is a highly nonconvex optimization problem. Therefore, while most of the previous works can study the minimum norm interpolators/maximum margin classifiers according to the implicit bias (Soudry et al., 2018) results for the corresponding models, existing implicit bias results for neural networks (e.g., Lyu and Li (2019)) are not sufficient and a new analysis of the neural network learning process is in demand.

In this work, we provide one such algorithmic analysis for learning two-layer convolutional neural networks (CNNs) with the second layer parameters being fixed as +1+1’s and 1-1’s and polynomial ReLU activation function: σ(z)=max{0,z}q\sigma(z)=\max\{0,z\}^{q}, where q>2q>2 is a hyperparameter. We consider a setting where the input data consist of label dependent signals and label independent noises, and utilize a signal-noise decomposition of the CNN filters to precisely characterize the signal learning and noise memorization processes during neural network training. Our result not only demonstrates that benign overfitting can occur in learning two-layer neural networks, but also gives precise conditions under which the overfitted CNN trained by gradient descent can achieve small population loss. Our paper makes the following major contributions:

  • We establish population loss bounds of overfitted CNN models trained by gradient descent, and theoretically demonstrate that benign overfitting can occur in learning over-parameterized neural networks. We show that under certain conditions on the signal-to-noise ratio, CNN models trained by gradient descent will prioritize learning the signal over memorizing the noise, and thus achieving both small training and test losses. To the best of our knowledge, this is the first result on the benign overfitting of neural networks that is beyond the neural tangent kernel regime.

  • We also establish a negative result showing that when the conditions on the signal-to-noise ratio do not hold, then the overfitted CNN model will achieve at least a constant population loss. This result, together with our upper bound result, reveals an interesting phase transition between benign overfitting and harmful overfitting.

  • Our analysis is based on a new proof technique namely signal-noise decomposition, which decomposes the convolutional filters into a linear combination of initial filters, the signal vectors and the noise vectors. We convert the neural network learning into a discrete dynamical system of the coefficients from the decomposition, and perform a two-stage analysis that decouples the complicated relation among the coefficients. This enables us to analyze the non-convex optimization problem, and bound the population loss of the CNN trained by gradient descent. We believe our proof technique is of independent interest and can potentially be applied to deep neural networks.

We note that a concurrent work (Frei et al., 2022) studies learning log-Concave mixture data with label flip noise using fully-connected two-layer neural networks with smoothed leaky ReLU activation. Notably, their risk bound matches the risk bound for linear models given in Cao et al. (2021) when the label flip noise is zero. However, their analysis only focuses on upper bounding the risk, and cannot demonstrate the phase transition between benign and harmful overfitting. Compared with (Frei et al., 2022), we focus on CNNs, and consider a different data model to better capture the nature of image classification problems. Moreover, we present both positive and negative results under different SNR regimes, and demonstrate a sharp phase transition between benign and harmful overfitting.

Notation.

Given two sequences {xn}\{x_{n}\} and {yn}\{y_{n}\}, we denote xn=O(yn)x_{n}=O(y_{n}) if there exist some absolute constant C1>0C_{1}>0 and N>0N>0 such that |xn|C1|yn||x_{n}|\leq C_{1}|y_{n}| for all nNn\geq N. Similarly, we denote xn=Ω(yn)x_{n}=\Omega(y_{n}) if there exist C2>0C_{2}>0 and N>0N>0 such that |xn|C2|yn||x_{n}|\geq C_{2}|y_{n}| for all n>Nn>N. We say xn=Θ(yn)x_{n}=\Theta(y_{n}) if xn=O(yn)x_{n}=O(y_{n}) and xn=Ω(yn)x_{n}=\Omega(y_{n}) both holds. We use O~()\widetilde{O}(\cdot), Ω~()\widetilde{\Omega}(\cdot), and Θ~()\widetilde{\Theta}(\cdot) to hide logarithmic factors in these notations respectively. Moreover, we denote xn=poly(yn)x_{n}=\mathrm{poly}(y_{n}) if xn=O(ynD)x_{n}=O(y_{n}^{D}) for some positive constant DD, and xn=polylog(yn)x_{n}=\operatorname{\rm polylog}(y_{n}) if xn=poly(log(yn))x_{n}=\mathrm{poly}(\log(y_{n})). Finally, for two scalars aa and bb, we denote ab=max{a,b}a\vee b=\max\{a,b\}.

2 Related Work

A line of recent works have attempted to understand why overfitted predictors can still achieve a good test performance. Belkin et al. (2019a) first empirically demonstrated that in many machine learning models such as random Fourier features, decision trees and ensemble methods , the population risk curve has a double descent shape with respect to the number of model parameters. Belkin et al. (2019b) further studied two specific data models, namely the Gaussian model and Fourier series model, and theoretically demonstrated the double descent risk curve in linear regression. Bartlett et al. (2020) studied over-parameterized linear regression to fit data produced by a linear model with additive noises, and established matching upper and lower bounds of the risk achieved by the minimum norm interpolator on the training dataset. It is shown that under certain conditions on the spectrum of the data covariance matrix, the population risk of the interpolator can be asymptotically optimal. Hastie et al. (2019); Wu and Xu (2020) studied linear regression in the setting where both the dimension and sample size grow together with a fixed ratio, and showed double descent of the risk with respect to this ratio. Chatterji and Long (2020) studied the population risk bounds of over-parameterized linear logistic regression on sub-Gaussian mixture models with label flipping noises, and showed how gradient descent can train over-parameterized linear models to achieve nearly optimal population risk. Cao et al. (2021) tightened the upper bound given by Chatterji and Long (2020) in the case without the label flipping noises, and established a matching lower bound of the risk achieved by over-parameterized maximum margin interpolators. Shamir (2022) proposed a generic data model for benign overfitting of linear predictors, and studied different problem settings under which benign overfitting can or cannot occur.

Besides the studies on linear models, several recent works also studied the benign overfitting and double descent phenomena in kernel methods or random feature models. Zhang et al. (2017) first pointed out that overfitting kernel predictors can sometimes still achieve good population risk. Liang and Rakhlin (2020) studied how interpolating kernel regression with radial basis function (RBF) kernels (and variants) can generalize and how the spectrum of the data covariance matrix affects the population risk of the interpolating kernel predictor. Li et al. (2021) studied the benign overfitting phenomenon of random feature models defined as two-layer neural networks whose first layer parameters are fixed at random initialization. Mei and Montanari (2019); Liao et al. (2020) demonstrated the double descent phenomenon for the population risk of interpolating random feature predictors with respect to the ratio between the dimensions of the random feature and the data input. Adlam and Pennington (2020) shows that neural tangent kernel (Jacot et al., 2018) based kernel regression has a triple descent risk curve with respect to the total number of trainable parameters. Montanari and Zhong (2020) further pointed out an interesting phase transition of the generalization error achieved by neural networks trained in the neural tangent kernel regime.

3 Problem Setup

In this section, we introduce the data generation model and the convolutional neural network we consider in this paper. We focus on binary classification, and present our data distribution 𝒟\mathcal{D} in the following definition.

Definition 3.1.

Let 𝛍d\bm{\mu}\in\mathbb{R}^{d} be a fixed vector representing the signal contained in each data point. Then each data point (𝐱,y)(\mathbf{x},y) with 𝐱=[𝐱1,𝐱2]2d\mathbf{x}=[\mathbf{x}_{1}^{\top},\mathbf{x}_{2}^{\top}]^{\top}\in\mathbb{R}^{2d} and y{1,1}y\in\{-1,1\} is generated from the following distribution 𝒟\mathcal{D}:

  1. 1.

    The label yy is generated as a Rademacher random variable.

  2. 2.

    A noise vector 𝝃\bm{\xi} is generated from the Gaussian distribution N(𝟎,σp2(𝐈𝝁𝝁𝝁22))N(\mathbf{0},\sigma_{p}^{2}\cdot(\mathbf{I}-\bm{\mu}\bm{\mu}^{\top}\cdot\|\bm{\mu}\|_{2}^{-2})).

  3. 3.

    One of 𝐱1,𝐱2\mathbf{x}_{1},\mathbf{x}_{2} is given as y𝝁y\cdot\bm{\mu}, which represents the signal, the other is given by 𝝃\bm{\xi}, which represents noises.

Our data generation model is inspired by image data, where the inputs consist of different patches, and only some of the patches are related to the class label of the image. In detail, the patch assigned as y𝝁y\cdot\bm{\mu} is the signal patch that is correlated to the label of the data, and the patch assigned as 𝝃\bm{\xi} is the noise patch that is independent of the label of the data and therefore is irrelevant for prediction. We assume that the noise patch is generated from the Gaussian distribution N(𝟎,σp2(𝐈𝝁𝝁𝝁22))N(\mathbf{0},\sigma_{p}^{2}\cdot(\mathbf{I}-\bm{\mu}\bm{\mu}^{\top}\cdot\|\bm{\mu}\|_{2}^{-2})) to ensure that the noise vector is orthogonal to the signal vector 𝝁\bm{\mu} for simplicity. Note that when the dimension dd is large, 𝝃2σpd\|\bm{\xi}\|_{2}\approx\sigma_{p}\sqrt{d} by standard concentration bounds. Therefore, we can treat 𝝁2/(σpd)𝝁2/𝝃2\|\bm{\mu}\|_{2}/(\sigma_{p}\sqrt{d})\approx\|\bm{\mu}\|_{2}/\|\bm{\xi}\|_{2} as the signal-to-noise ratio. For the ease of discussion, we denote SNR=𝝁2/(σpd)\mathrm{SNR}=\|\bm{\mu}\|_{2}/(\sigma_{p}\sqrt{d}). Note that the Bayes risk for learning our model is zero. We can also add label flip noise similar to Chatterji and Long (2020); Frei et al. (2022) to make the Bayes risk equal to the label flip noise and therefore nonzero, but this will not change the key message of our paper.

Intuitively, if a classifier learns the signal 𝝁\bm{\mu} and utilizes the signal patch of the data to make prediction, it can perfectly fit a given training data set {(𝐱i,yi):i[n]}\{(\mathbf{x}_{i},y_{i}):i\in[n]\} and at the same time have a good performance on the test data. However, when the dimension dd is large (d>nd>n), a classifier that is a function of the noises 𝝃i\bm{\xi}_{i}, i[n]i\in[n] can also perfectly fit the training data set, while the prediction will be totally random on the new test data. Therefore, the data generation model given in Definition 3.1 is a useful model to study the population loss of overfitted classifiers. Similar models have been studied in some recent works by Li et al. (2019); Allen-Zhu and Li (2020a, b); Zou et al. (2021a).

Two-layer CNNs. We consider a two-layer convolutional neural network whose filters are applied to the two patches 𝐱1\mathbf{x}_{1} and 𝐱2\mathbf{x}_{2} separately, and the second layer parameters of the network are fixed as +1/m+1/m and 1/m-1/m respectively. Then the network can be written as f(𝐖,𝐱)=F+1(𝐖+1,𝐱)F1(𝐖1,𝐱)f(\mathbf{W},\mathbf{x})=F_{+1}(\mathbf{W}_{+1},\mathbf{x})-F_{-1}(\mathbf{W}_{-1},\mathbf{x}), where F+1(𝐖+1,𝐱)F_{+1}(\mathbf{W}_{+1},\mathbf{x}), F1(𝐖1,𝐱)F_{-1}(\mathbf{W}_{-1},\mathbf{x}) are defined as:

Fj(𝐖j,𝐱)\displaystyle F_{j}(\mathbf{W}_{j},\mathbf{x}) =1mr=1m[σ(𝐰j,r,𝐱1)+σ(𝐰j,r,𝐱2)]=1mr=1m[σ(𝐰j,r,y𝝁)+σ(𝐰j,r,𝝃)],\displaystyle=\frac{1}{m}{\sum_{r=1}^{m}}\big{[}\sigma(\langle\mathbf{w}_{j,r},\mathbf{x}_{1}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\mathbf{x}_{2}\rangle)\big{]}=\frac{1}{m}{\sum_{r=1}^{m}}\big{[}\sigma(\langle\mathbf{w}_{j,r},y\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}\rangle)\big{]},

for j{+1,1}j\in\{+1,-1\}, mm is the number of convolutional filters in F+1F_{+1} and F1F_{-1}. Here, σ(z)=(max{0,z})q\sigma(z)=(\max\{0,z\})^{q} is the ReLUq\mathrm{ReLU}^{q} activation function where q>2q>2, 𝐰j,rd\mathbf{w}_{j,r}\in\mathbb{R}^{d} denotes the weight for the rr-th filter (i.e., neuron), and 𝐖j\mathbf{W}_{j} is the collection of model weights associated with FjF_{j}. We also use 𝐖\mathbf{W} to denote the collection of all model weights. We note that our CNN model can also be viewed as a CNN with average global pooling (Lin et al., 2013). We train the above CNN model by minimizing the empirical cross-entropy loss function

LS(𝑾)\displaystyle L_{S}(\bm{W}) =1ni=1n[yif(𝐖,𝐱i)],\displaystyle=\frac{1}{n}{\sum_{i=1}^{n}}\ell[y_{i}\cdot f(\mathbf{W},\mathbf{x}_{i})],

where (z)=log(1+exp(z))\ell(z)=\log(1+\exp(-z)), and S={(𝐱i,yi)}i=1nS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} is the training data set. We further define the true loss (test loss) L𝒟(𝐖):=𝔼(𝐱,y)𝒟[yf(𝐖,𝐱)]L_{\mathcal{D}}(\mathbf{W}):=\mathbb{E}_{(\mathbf{x},y)\sim\mathcal{D}}\ell[y\cdot f(\mathbf{W},\mathbf{x})].

We consider gradient descent starting from Gaussian initialization, where each entry of 𝐖+1\mathbf{W}_{+1} and 𝐖1\mathbf{W}_{-1} is sampled from a Gaussian distribution N(0,σ02)N(0,\sigma_{0}^{2}), and σ02\sigma_{0}^{2} is the variance. The gradient descent update of the filters in the CNN can be written as

𝐰j,r(t+1)\displaystyle\mathbf{w}_{j,r}^{(t+1)} =𝐰j,r(t)η𝐰j,rLS(𝑾(t))\displaystyle=\mathbf{w}_{j,r}^{(t)}-\eta\cdot\nabla_{\mathbf{w}_{j,r}}L_{S}(\bm{W}^{(t)})
=𝐰j,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),𝝃i)jyi𝝃iηnmi=1ni(t)σ(𝐰j,r(t),yi𝝁)j𝝁\displaystyle=\mathbf{w}_{j,r}^{(t)}-\frac{\eta}{nm}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot jy_{i}\bm{\xi}_{i}-\frac{\eta}{nm}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\cdot j\bm{\mu} (3.1)

for j{±1}j\in\{\pm 1\} and r[m]r\in[m], where we introduce a shorthand notation i(t)=[yif(𝐖(t),𝐱i)]\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})].

4 Main Results

In this section, we present our main theoretical results. At the core of our analyses and results is a signal-noise decomposition of the filters in the CNN trained by gradient descent. By the gradient descent update rule (3.1), it is clear that the gradient descent iterate 𝐰j,r(t)\mathbf{w}_{j,r}^{(t)} is a linear combination of its random initialization 𝐰j,r(0)\mathbf{w}_{j,r}^{(0)}, the signal vector 𝝁\bm{\mu} and the noise vectors in the training data 𝝃i\bm{\xi}_{i}, i[n]i\in[n]. Motivated by this observation, we introduce the following definition.

Definition 4.1.

Let 𝐰j,r(t)\mathbf{w}_{j,r}^{(t)} for j{±1}j\in\{\pm 1\}, r[m]r\in[m] be the convolution filters of the CNN at the tt-th iteration of gradient descent. Then there exist unique coefficients γj,r(t)0\gamma_{j,r}^{(t)}\geq 0 and ρj,r,i(t)\rho_{j,r,i}^{(t)} such that

𝐰j,r(t)=𝐰j,r(0)+jγj,r(t)𝝁22𝝁+i=1nρj,r,i(t)𝝃i22𝝃i.\displaystyle\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}+\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}.

We further denote ρ¯j,r,i(t):=ρj,r,i(t)𝟙(ρj,r,i(t)0)\overline{\rho}_{j,r,i}^{(t)}:=\rho_{j,r,i}^{(t)}\operatorname{\mathds{1}}(\rho_{j,r,i}^{(t)}\geq 0), ρ¯j,r,i(t):=ρj,r,i(t)𝟙(ρj,r,i(t)0)\underline{\rho}_{j,r,i}^{(t)}:=\rho_{j,r,i}^{(t)}\operatorname{\mathds{1}}(\rho_{j,r,i}^{(t)}\leq 0). Then we have that

𝐰j,r(t)=𝐰j,r(0)+jγj,r(t)𝝁22𝝁+i=1nρ¯j,r,i(t)𝝃i22𝝃i+i=1nρ¯j,r,i(t)𝝃i22𝝃i.\displaystyle\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}. (4.1)

We refer to (4.1) as the signal-noise decomposition of 𝐰j,r(t)\mathbf{w}_{j,r}^{(t)}. We add normalization factors 𝝁22,𝝃i22\|\bm{\mu}\|_{2}^{-2},\|\bm{\xi}_{i}\|_{2}^{-2} in the definition so that γj,r(t)𝐰j,r(t),𝝁,ρj,r,i(t)𝐰j,r(t),𝝃i\gamma_{j,r}^{(t)}\approx\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle,\rho_{j,r,i}^{(t)}\approx\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle. In this decomposition, γj,r(t)\gamma_{j,r}^{(t)} characterizes the progress of learning the signal vector 𝝁\bm{\mu}, and ρj,r,i(t)\rho_{j,r,i}^{(t)} characterizes the degree of noise memorization by the filter. Evidently, based on this decomposition, for some iteration tt, (i) If some of γj,r(t)\gamma_{j,r}^{(t)}’s are large enough while |ρj,r,i(t)||\rho_{j,r,i}^{(t)}| are relatively small, then the CNN will have small training and test losses; (ii) If some ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)}’s are large and all γj,r(t)\gamma_{j,r}^{(t)}’s are small, then the CNN will achieve a small training loss, but a large test loss. Thus, Definition 4.1 provides a handle for us to study the convergence of the training loss as well as the the population loss of the CNN trained by gradient descent.

Our results are based on the following conditions on the dimension dd, sample size nn, neural network width mm, learning rate η\eta, initialization scale σ0\sigma_{0}.

Condition 4.2.

Suppose that

  1. 1.

    Dimension dd is sufficiently large: d=Ω~(m2[4/(q2)]n4[(2q2)/(q2)])d=\widetilde{\Omega}(m^{2\vee[4/(q-2)]}n^{4\vee[(2q-2)/(q-2)]}).

  2. 2.

    Training sample size nn and neural network width mm satisfy n,m=Ω(polylog(d))n,m=\Omega(\operatorname{\rm polylog}(d)).

  3. 3.

    The learning rate η\eta satisfies ηO~(min{𝝁22,σp2d1})\eta\leq\widetilde{O}(\min\{\|\bm{\mu}\|_{2}^{-2},\sigma_{p}^{-2}d^{-1}\}).

  4. 4.

    The standard deviation of Gaussian initialization σ0\sigma_{0} is appropriately chosen such that O~(nd1/2)min{(σpd)1,𝝁21}σ0O~(m2/(q2)n[1/(q2)]1)min{(σpd)1,𝝁21}\widetilde{O}(nd^{-1/2})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}\leq\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-[1/(q-2)]\vee 1})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}.

A few remarks on Condition 4.2 are in order. The condition on dd is to ensure that the learning is in a sufficiently over-parameterized setting, and similar conditions have been made in the study of learning over-parameterized linear models (Chatterji and Long, 2020; Cao et al., 2021). For example, if we choose q=3q=3, then the condition on dd becomes d=Ω~(m4n4)d=\widetilde{\Omega}(m^{4}n^{4}). Furthermore, we require the sample size and neural network width to be at least polylogarithmic in the dimension dd to ensure some statistical properties of the training data and weight initialization to hold with probability at least 1d11-d^{-1}, which is a mild condition. Finally, the conditions on σ0\sigma_{0} and η\eta are to ensure that gradient descent can effectively minimize the training loss, and they depend on the scale of the training data points. When σp=O(d1/2)\sigma_{p}=O(d^{-1/2}) and 𝝁2=O(1)\|\bm{\mu}\|_{2}=O(1), the step size η\eta can be chosen as large as O~(1)\widetilde{O}(1) and the initialization σ0\sigma_{0} can be as large as O~(m2/(q2)n[1/(q2)]1)\widetilde{O}(m^{-2/(q-2)}n^{-[1/(q-2)]\vee 1}). In our paper, we only require m,n=Ω(polylog(d))m,n=\Omega(\text{polylog}(d)), so our initialization and step-size can be chosen as an almost constant order. Based on these conditions, we give our main result on signal learning in the following theorem.

Theorem 4.3.

For any ϵ>0\epsilon>0, let T=Θ~(η1mσ0(q2)𝛍2q+η1ϵ1m3𝛍22)T=\widetilde{\Theta}(\eta^{-1}m\sigma_{0}^{-(q-2)}\|\bm{\mu}\|_{2}^{-q}+\eta^{-1}\epsilon^{-1}m^{3}\|\bm{\mu}\|_{2}^{-2}). Under Condition 4.2, if nSNRq=Ω~(1)n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1)111Here the Ω~()\widetilde{\Omega}(\cdot) hides an polylog(ϵ1)\operatorname{\rm polylog}(\epsilon^{-1}) factor. This applies to Theorem 4.4 as well., then with probability at least 1d11-d^{-1}, there exists 0tT0\leq t\leq T such that:

  1. 1.

    The CNN learns the signal: maxrγj,r(t)=Ω(1)\max_{r}\gamma_{j,r}^{(t)}=\Omega(1) for j{±1}j\in\{\pm 1\}.

  2. 2.

    The CNN does not memorize the noises in the training data: maxj,r,i|ρj,r,i(T)|=O~(σ0σpd)\max_{j,r,i}|\rho_{j,r,i}^{(T)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}).

  3. 3.

    The training loss converges to ϵ\epsilon, i.e., LS(𝐖(t))ϵL_{S}(\mathbf{W}^{(t)})\leq\epsilon.

  4. 4.

    The trained CNN achieves a small test loss: L𝒟(𝐖(t))6ϵ+exp(n2)L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\epsilon+\exp(-n^{2})

Theorem 4.3 characterizes the case of signal learning. It shows that, if nSNRq=Ω~(1)n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1), then at least one CNN filter can learn the signal by achieving γj,rj(t)Ω(1)\gamma_{j,r_{j}^{*}}^{(t)}\geq\Omega(1), and as a result, the learned neural network can achieve small training and test losses. To demonstrate the sharpness of this condition, we also present the following theorem for the noise memorization by the CNN.

Theorem 4.4.

For any ϵ>0\epsilon>0, let T=Θ~(η1mn(σpd)qσ0(q2)+η1ϵ1nm3d1σp2)T=\widetilde{\Theta}(\eta^{-1}m\cdot n(\sigma_{p}\sqrt{d})^{-q}\cdot\sigma_{0}^{-(q-2)}+\eta^{-1}\epsilon^{-1}nm^{3}d^{-1}\sigma_{p}^{-2}). Under Condition 4.2, if n1SNRq=Ω~(1)n^{-1}\cdot\mathrm{SNR}^{-q}=\widetilde{\Omega}(1), then with probability at least 1d11-d^{-1}, there exists 0tT0\leq t\leq T such that:

  1. 1.

    The CNN memorizes noises in the training data: maxrρ¯yi,r,i(t)=Ω(1)\max_{r}\overline{\rho}_{y_{i},r,i}^{(t)}=\Omega(1).

  2. 2.

    The CNN does not sufficiently learn the signal: maxj,rγj,r(t)O~(σ0𝝁2)\max_{j,r}\gamma_{j,r}^{(t)}\leq\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}).

  3. 3.

    The training loss converges to ϵ\epsilon, i.e., LS(𝐖(t))ϵL_{S}(\mathbf{W}^{(t)})\leq\epsilon.

  4. 4.

    The trained CNN has a constant order test loss: L𝒟(𝐖(t))=Θ(1)L_{\mathcal{D}}({\mathbf{W}}^{(t)})=\Theta(1).

Refer to caption
Figure 1: Illustration of the phase transition between benign and harmful overfitting. The blue region represents the setting under which the overfitted CNN trained by gradient descent is guaranteed to have small population loss, and the yellow region represents the setting under which the population loss is guaranteed to be of constant order. The slim gray band region is the setting where the population loss is not well characterized.

Theorem 4.4 holds under the condition that n1SNRq=Ω~(1)n^{-1}\cdot\mathrm{SNR}^{-q}=\widetilde{\Omega}(1). Clearly, this is the opposite regime (up to some logarithmic factors) compared with Theorem 4.3. In this case, the CNN trained by gradient descent mainly memorizes noises in the training data and does not learn enough signal. This, together with the results in Theorem 4.3, reveals a clear phase transition between signal learning and noise memorization in CNN training:

  • If nSNRq=Ω~(1)n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1), then the CNN learns the signal and achieves a O(ϵ+exp(n2))O(\epsilon+\exp(-n^{2})) test loss. This is the regime of benign overfitting.

  • If n1SNRq=Ω~(1)n^{-1}\cdot\mathrm{SNR}^{-q}=\widetilde{\Omega}(1) then the CNN can only memorize noises and will have a Θ(1)\Theta(1) test loss. This is the regime of harmful overfitting.

The phase transition is illustrated in Figure 1. Clearly, when learning a two-layer CNN on the data generated from Definition 3.1,nSNRq=Ω~(1)n\cdot\mathrm{SNR}^{q}=\widetilde{\Omega}(1) is the precise condition under which benign overfitting occurs. Remarkably, in this case the population loss decreases exponentially with the sample size nn. Under our condition that n=Ω(polylog(d))n=\Omega(\operatorname{\rm polylog}(d)), this term can also be upper bounded by 1/poly(d)1/\mathrm{poly}(d), which is small in the high-dimensional setting. Note that when 𝝁2=Θ(1)\|\bm{\mu}\|_{2}=\Theta(1) and σp=Θ(d1/2)\sigma_{p}=\Theta(d^{-1/2}), applying standard uniform convergence based bounds (Bartlett et al., 2017; Neyshabur et al., 2018) or stability based bounds (Hardt et al., 2016; Mou et al., 2017; Chen et al., 2018) typically gives a O~(n1/2)\widetilde{O}(n^{-1/2}) bound on the generalization gap, which is vacuous when n=O(polylog(d))n=O(\operatorname{\rm polylog}(d)). Our bound under the same setting is O(1/poly(d))O(1/\mathrm{poly}(d)), which is non-vacuous. This is attributed to our precise analysis of signal learning and noise memorization in Theorems 4.3 and 4.4.

Comparison with neural tangent kernel (NTK) results. We want to emphasize that our analysis is beyond the so-called neural tangent kernel regime. In the NTK regime, it has been shown that gradient descent can train an over-parameterized neural network to achieve good training and test accuracies (Jacot et al., 2018; Du et al., 2019b, a; Allen-Zhu et al., 2019b; Zou et al., 2019; Arora et al., 2019a; Cao and Gu, 2019a; Chen et al., 2019). However, it is widely believed in literature that the NTK analyses cannot fully explain the success of deep learning, as the neural networks in the NTK regime are almost “linearized” (Lee et al., 2019; Cao and Gu, 2019a). Our analysis and results are not in the NTK regime: In the NTK regime, the network parameters stay close to their initialization throughout training, i.e., 𝐖(t)𝐖(0)F=O(1)\|\mathbf{W}^{(t)}-\mathbf{W}^{(0)}\|_{F}=O(1), so that the NN model can be approximated by its linearization (Allen-Zhu et al., 2019b; Cao and Gu, 2019a; Chen et al., 2019). In comparison, our analysis does not rely on linearizing the neural network function, and 𝐖(t)𝐖(0)F\|\mathbf{W}^{(t)}-\mathbf{W}^{(0)}\|_{F} can be as large as O(poly(m))O(\text{poly}(m)).

5 Overview of Proof Technique

In this section, we discuss the main challenges in the study of CNN training under our setting, and explain some key techniques we implement in our proofs to overcome these challenges. The complete proofs of all the results are given in the appendix.

Main challenges. Studying benign overfitting under our setting is a challenging task. The first challenge is the nonconvexity of the training objective function LS(𝐖)L_{S}(\mathbf{W}). Nonconvexity has introduced new challenges in the study of benign overfitting particularly because our goal is not only to show the convergence of the training loss, but also to study the population loss in the over-parameterized setting, which requires a precise algorithmic analysis of the learning problem.

5.1 Iterative Analysis of the Signal-Noise Decomposition

In order to study the learning process based on the nonconvex optimization problem, we propose a key technique which enables the iterative analysis of the coefficients in the signal-noise decomposition in Definition 4.1. This technique is given in the following lemma.

Lemma 5.1.

The coefficients γj,r(t),ρ¯j,r,i(t),ρ¯j,r,i(t)\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)} in Definition 4.1 satisfy the following equations:

γj,r(0),ρ¯j,r,i(0),ρ¯j,r,i(0)=0,\displaystyle\gamma_{j,r}^{(0)},\overline{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j,r,i}^{(0)}=0, (5.1)
γj,r(t+1)=γj,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),yi𝝁)𝝁22,\displaystyle\gamma_{j,r}^{(t+1)}=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\|\bm{\mu}\|_{2}^{2}, (5.2)
ρ¯j,r,i(t+1)=ρ¯j,r,i(t)ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22𝟙(yi=j),\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j), (5.3)
ρ¯j,r,i(t+1)=ρ¯j,r,i(t)+ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22𝟙(yi=j).\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j). (5.4)
Remark 5.2.

With the decomposition (4.1), the signal learning and noise memorization processes of a CNN can be formally studied by analyzing the dynamics of γj,r(t),ρ¯j,r,i(t),ρ¯j,r,i(t)\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)} based on the dynamical system (5.2)-(5.4). Note that prior to our work, several existing results have utilized the inner products 𝐰j,r(t),𝛍\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle during the neural network training process in order to establish generalization bounds (Brutzkus et al., 2018; Chatterji and Long, 2020; Frei et al., 2021). Similar inner product based arguments are also implemented in Allen-Zhu and Li (2020a, b); Zou et al. (2021a), which study different topics related to learning neural networks. Compared with the inner product based argument, our method has two major advantages: (i) Based on the definition (5.2)-(5.4) and the fact that i(t)<0\ell_{i}^{\prime(t)}<0, it is clear that γj,r(t),ρ¯j,r,i(t)\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)} are monotonically increasing, while ρ¯j,r,i(t)\underline{\rho}_{j,r,i}^{(t)} is monotonically decreasing throughout the whole training process. In comparison, monotonicity does not hold in the inner product based argument, especially for 𝐰j,r(t),𝛏i\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle. (ii) Our signal-noise decomposition also enables a clean homogeneity-based proof for the convergence of the training loss to an arbitrarily small error rate ϵ>0\epsilon>0, which will be presented in Subsection 5.2.

With Lemma 5.1, we can reduce the study of the CNN learning process to the analysis of the discrete dynamical system given by (5.1)-(5.4). Our proof then focuses on a careful assessment of the values of the coefficients γj,r(t),ρ¯j,r,i(t),ρ¯j,r,i(t)\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)} throughout training. To prepare for more detailed analyses, we first present the following bounds of the coefficients, which hold throughout training.

Proposition 5.3.

Under Condition 4.2, for any T=η1poly(ϵ1,𝛍21,d1σp2,σ01,n,m,d)T^{*}=\eta^{-1}\text{poly}(\epsilon^{-1},\|\bm{\mu}\|_{2}^{-1},d^{-1}\sigma_{p}^{-2},\sigma_{0}^{-1},n,m,d), the following bounds hold for t[0,T]t\in[0,T^{*}]:

  • 0γj,r(t),ρ¯j,r,i(t)4log(T)0\leq\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\leq 4\log(T^{*}) for all j{±1}j\in\{\pm 1\}, r[m]r\in[m] and i[n]i\in[n].

  • 0ρ¯j,r,i(t)2maxi,j,r{|𝐰j,r(0),𝝁|,|𝐰j,r(0),𝝃i|}16nlog(4n2/δ)d4log(T)0\geq\underline{\rho}_{j,r,i}^{(t)}\geq-2\max_{i,j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\}-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\cdot 4\log(T^{*}) for all j{±1}j\in\{\pm 1\}, r[m]r\in[m] and i[n]i\in[n].

We can then prove the following lemma, which demonstrates that the training objective function LS(𝐖)L_{S}(\mathbf{W}) can dominate the gradient norm LS(𝐖(t))F\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F} along the gradient descent path.

Lemma 5.4.

Under Condition 4.2, for any T=η1poly(ϵ1,𝛍21,d1σp2,σ01,n,m,d)T^{*}=\eta^{-1}\text{poly}(\epsilon^{-1},\|\bm{\mu}\|_{2}^{-1},d^{-1}\sigma_{p}^{-2},\sigma_{0}^{-1},n,m,d), the following result holds for t[0,T]t\in[0,T^{*}]:

LS(𝐖(t))F2=O(max{𝝁22,σp2d})LS(𝐖(t)).\displaystyle\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}=O\big{(}\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\}\big{)}\cdot L_{S}(\mathbf{W}^{(t)}).

Lemma 5.4 plays a key role in the convergence proof of training loss function. However, note that our study of benign overfitting requires carefully monitoring the changes of the coefficients in the signal-noise decomposition, which cannot be directly done by Lemma 5.4. This is quite a challenging task, due to the complicated interactions among γj,r(t)\gamma_{j,r}^{(t)}, ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)} and ρ¯j,r,i(t)\underline{\rho}_{j,r,i}^{(t)}. Note that even γj,r(t)\gamma_{j,r}^{(t)}, which has the simplest formula (5.2), depends on all the quantities γj,r(t)\gamma_{j^{\prime},r^{\prime}}^{(t)}, ρ¯j,r,i(t)\overline{\rho}_{j^{\prime},r^{\prime},i}^{(t)} and ρ¯j,r,i(t)\underline{\rho}_{j^{\prime},r^{\prime},i}^{(t)} for j{±1}j^{\prime}\in\{\pm 1\}, r[m]r^{\prime}\in[m] and i[n]i\in[n]. This is because the cross-entropy loss derivative term i(t)=[yif(𝐖(t),𝐱i)]\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})] depends on all the neurons of the network. To overcome this challenge, we introduce in the next subsection a decoupling technique based on a two-stage analysis.

5.2 Decoupling with a Two-Stage Analysis.

We utilize a two-stage analysis to decouple the complicated relation among the coefficients γj,r(t)\gamma_{j,r}^{(t)}, ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)} and ρ¯j,r,i(t)\underline{\rho}_{j,r,i}^{(t)}. Intuitively, the initial neural network weights are small enough so that the neural network at initialization has constant level cross-entropy loss derivatives on all the training data: i(0)=[yif(𝐖(0),𝐱i)]=Θ(1)\ell_{i}^{\prime(0)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(0)},\mathbf{x}_{i})]=\Theta(1) for all i[n]i\in[n]. This is guaranteed under Condition 4.2 and matches neural network training in practice. Motivated by this, we can consider the first stage of the training process where i(t)=Θ(1)\ell_{i}^{\prime(t)}=\Theta(1), in which case we can show significant scale differences among γj,r(t)\gamma_{j,r}^{(t)}, ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)} and ρ¯j,r,i(t)\underline{\rho}_{j,r,i}^{(t)}. Based on the result in the first stage, we then proceed to the second stage of the training process where the loss derivatives are no longer at a constant level and show that the training loss can be optimized to be arbitrarily small and meanwhile, the scale differences shown in the first learning stage remain the same throughout the training process. In the following, we focus on explaining the key proof steps for Theorem 4.3. The proof idea for Theorem 4.4 is similar, so we defer the details to the appendix.

Stage 1. It can be shown that, until some of the coefficients γj,r(t)\gamma_{j,r}^{(t)}, ρj,r,i(t)\rho_{j,r,i}^{(t)} reach Θ(1)\Theta(1), we have i(t)=[yif(𝐖(t),𝐱i)]=Θ(1)\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]=\Theta(1) for all i[n]i\in[n]. Therefore, we first focus on this first stage of the training process, where the dynamics of the coefficients in (5.2) - (5.4) can be greatly simplified by replacing the i(t)\ell_{i}^{\prime(t)} factors by their constant upper and lower bounds. The following lemma summarizes our main conclusion at stage 1 for signal learning:

Lemma 5.5.

Under the same conditions as Theorem 4.3, there exists T1=O~(η1mσ02q𝛍2q)T_{1}=\widetilde{O}(\eta^{-1}m\sigma_{0}^{2-q}\|\bm{\mu}\|_{2}^{-q}) such that

  • maxrγj,r(T1)=Ω(1)\max_{r}\gamma_{j,r}^{(T_{1})}=\Omega(1) for j{±1}j\in\{\pm 1\}.

  • |ρj,r,i(t)|=O(σ0σpd)|\rho_{j,r,i}^{(t)}|=O(\sigma_{0}\sigma_{p}\sqrt{d}) for all j{±1}j\in\{\pm 1\}, r[m]r\in[m], i[n]i\in[n] and 0tT10\leq t\leq T_{1}.

Lemmas 5.5 takes advantage of the training period when the loss function derivatives remain a constant order to show that the CNN can capture the signal. At the end of stage 1 in signal learning, maxrγj,r\max_{r}\gamma_{j,r} reaches Θ(1)\Theta(1), and is significantly larger than ρj,r,i(t)\rho_{j,r,i}^{(t)}. After this, it is no longer guaranteed that the loss derivatives i(t)\ell_{i}^{\prime(t)} will remain constant order, and thus starts the training stage 2.

Stage 2. In this stage, we take into full consideration the exact definition i(t)=[yif(𝐖(t),𝐱i)]\ell_{i}^{\prime(t)}=\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})] and show that the training loss function will converge to LS(𝐖(t))<ϵL_{S}(\mathbf{W}^{(t)})<\epsilon. Thanks to the analysis in stage 1, we know that some γj,r(t)\gamma_{j,r}^{(t)} is significantly larger than all ρj,r,i(t)\rho_{j,r,i}^{(t)}’s at the end of stage 1. This scale difference is the key to our analysis in stage 2. Based on this scale difference and the monotonicity of γj,r(t)\gamma_{j,r}^{(t)}, ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)}, ρ¯j,r,i(t)\underline{\rho}_{j,r,i}^{(t)} in the signal-noise decomposition, it can be shown that there exists 𝐖\mathbf{W}^{*} such that yif(𝐖(t),𝐱i),𝐖qlog(2q/ϵ)y_{i}\cdot\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle\geq q\log(2q/\epsilon) throughout stage 2. Moreover, since the neural network f(𝐖,𝐱)f(\mathbf{W},\mathbf{x}) is qq-homogeneous in 𝐖\mathbf{W}, we have f(𝐖(t),𝐱),𝐖(t)=qf(𝐖(t),𝐱)\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}),\mathbf{W}^{(t)}\rangle=q\cdot f(\mathbf{W}^{(t)},\mathbf{x}). Therefore,

LS(𝐖(t)),𝐖(t)𝐖\displaystyle\langle\nabla L_{S}(\mathbf{W}^{(t)}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle =1ni=1ni(t)yif(𝐖(t),𝐱i),𝐖(t)𝐖\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot y_{i}\cdot\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle
=1ni=1ni(t)[yiqf(𝐖(t),𝐱i)yif(𝐖(t),𝐱i),𝐖]\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot[y_{i}\cdot q\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})-y_{i}\cdot\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle]
1ni=1n[yif(𝐖(t),𝐱i)][yiqf(𝐖(t),𝐱i)qlog(2q/ϵ)]\displaystyle\geq\frac{1}{n}\sum_{i=1}^{n}\ell^{\prime}[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})]\cdot[y_{i}\cdot q\cdot f(\mathbf{W}^{(t)},\mathbf{x}_{i})-q\log(2q/\epsilon)]
q1ni=1n[(f(𝐖(t),𝐱i))(log(2q/ϵ))]\displaystyle\geq q\cdot\frac{1}{n}\sum_{i=1}^{n}[\ell(f(\mathbf{W}^{(t)},\mathbf{x}_{i}))-\ell(\log(2q/\epsilon))]
qLS(𝐖(t))ϵ/2,\displaystyle\geq q\cdot L_{S}(\mathbf{W}^{(t)})-\epsilon/2,

where the second inequality follows by the convexity of the cross-entropy loss function. With the above key technique, we can prove the following lemma.

Lemma 5.6.

Let T,T1T,T_{1} be defined in Theorem 4.3 and Lemma 5.5 respectively. Then under the same conditions as Theorem 4.3, for any t[T1,T]t\in[T_{1},T], it holds that |ρj,r,i(t)|σ0σpd|\rho_{j,r,i}^{(t)}|\leq\sigma_{0}\sigma_{p}\sqrt{d} for all j{±1}j\in\{\pm 1\}, r[m]r\in[m] and i[n]i\in[n]. Moreover, let 𝐖\mathbf{W}^{*} be the collection of CNN parameters with convolution filters 𝐰j,r=𝐰j,r(0)+2qmlog(2q/ϵ)j𝛍22𝛍\mathbf{w}^{*}_{j,r}=\mathbf{w}_{j,r}^{(0)}+2qm\log(2q/\epsilon)\cdot j\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}. Then the following bound holds

1tT1+1s=T1tLS(𝐖(s))𝐖(T1)𝐖F2(2q1)η(tT1+1)+ϵ(2q1)\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{(2q-1)}

for all t[T1,T]t\in[T_{1},T], where we denote 𝐖F=𝐖+1F2+𝐖1F2\|\mathbf{W}\|_{F}=\sqrt{\|\mathbf{W}_{+1}\|_{F}^{2}+\|\mathbf{W}_{-1}\|_{F}^{2}}.

Lemma 5.6 states two main results on signal learning. First of all, during this training period, it is guaranteed that the coefficients of noise vectors ρj,r,i(t)\rho_{j,r,i}^{(t)} in the signal-noise decomposition remain sufficiently small. Moreover, it also gives an optimization type result that the best iterate in [T1,T][T_{1},T] is small as long as TT is large enough.

Clearly, the convergence of the training loss stated in Theorems 4.3 directly follows by choosing TT to be sufficiently large in Lemmas 5.6. The lemma below further gives an upper bound on the test loss.

Lemma 5.7.

Let TT be defined in Theorem 4.3. Under the same conditions as Theorem 4.3, for any tTt\leq T with LS(𝐖(t))1L_{S}(\mathbf{W}^{(t)})\leq 1, it holds that L𝒟(𝐖(t))6LS(𝐖(t))+exp(n2)L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\cdot L_{S}(\mathbf{W}^{(t)})+\exp(-n^{2}).

Below we finalize the proof of Theorem 4.3. The proofs of other results are in the appendix.

Proof of Theorem 4.3.

The first part of Theorem 4.3 follows by Lemma 5.5 and the monotonicity of γj,r(t)\gamma_{j,r}^{(t)}. The second part of Theorem 4.3 follows by Lemma 5.6. For the third part, let 𝐖\mathbf{W}^{*} be defined in Lemma 5.6. Then by the definition of 𝐖\mathbf{W}^{*}, we have

𝐖(T1)𝐖F\displaystyle\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F} 𝐖(T1)𝐖(0)F+𝐖(0)𝐖F\displaystyle\leq\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\|_{F}+\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\|_{F}
j,rγj,r(T1)𝝁21+j,r,iρ¯j,r,i(T1)𝝃i2+j,r,iρ¯j,r,i(T1)𝝃i2+Θ(m3/2log(1/ϵ))𝝁21\displaystyle\leq\sum_{j,r}\gamma_{j,r}^{(T_{1})}\|\bm{\mu}\|_{2}^{-1}+\sum_{j,r,i}\frac{\overline{\rho}_{j,r,i}^{(T_{1})}}{\|\bm{\xi}_{i}\|_{2}}+\sum_{j,r,i}\frac{\underline{\rho}_{j,r,i}^{(T_{1})}}{\|\bm{\xi}_{i}\|_{2}}+\Theta(m^{3/2}\log(1/\epsilon))\|\bm{\mu}\|_{2}^{-1}
=O~(m3/2𝝁21),\displaystyle=\widetilde{O}(m^{3/2}\|\bm{\mu}\|_{2}^{-1}),

where the first inequality is by triangle inequality, the second inequality is by the signal-noise decomposition of 𝐖(T1)\mathbf{W}^{(T_{1})} and the definition of 𝐖\mathbf{W}^{*}, and the last equality is by Proposition 5.3 and Lemma 5.5. Therefore, choosing T=Θ~(η1T1+η1ϵ1m3𝝁22)=Θ~(η1σ0(q2)𝝁2q+η1ϵ1m3𝝁22)T=\widetilde{\Theta}(\eta^{-1}T_{1}+\eta^{-1}\epsilon^{-1}m^{3}\|\bm{\mu}\|_{2}^{-2})=\widetilde{\Theta}(\eta^{-1}\sigma_{0}^{-(q-2)}\|\bm{\mu}\|_{2}^{-q}+\eta^{-1}\epsilon^{-1}m^{3}\|\bm{\mu}\|_{2}^{-2}) in Lemma 5.6 ensures that

1TT1+1t=T1TLS(𝐖(t))𝐖(T1)𝐖F2(2q1)η(TT1+1)+ϵ2q1O~(m3𝝁22)(2q1)η(TT1+1)+ϵ2q1ϵ,\displaystyle\frac{1}{T-T_{1}+1}\sum_{t=T_{1}}^{T}L_{S}(\mathbf{W}^{(t)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(T-T_{1}+1)}+\frac{\epsilon}{2q-1}\leq\frac{\widetilde{O}(m^{3}\|\bm{\mu}\|_{2}^{-2})}{(2q-1)\eta(T-T_{1}+1)}+\frac{\epsilon}{2q-1}\leq\epsilon,

and there exists t[T1,T]t\in[T_{1},T] such that LS(𝐖(t))ϵL_{S}(\mathbf{W}^{(t)})\leq\epsilon. This completes the proof of the third part of Theorem 4.3. Finally, combining this bound with Lemma 5.7 gives

L𝒟(𝐖(t))6LS(𝐖(t))+exp(n2)6ϵ+exp(n2),\displaystyle L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\cdot L_{S}(\mathbf{W}^{(t)})+\exp(-n^{2})\leq 6\epsilon+\exp(-n^{2}),

which proves the last part of Theorem 4.3. ∎

6 Conclusion and Future Work

This paper utilizes a signal-noise decomposition to study the signal learning and noise memorization process in the training of a two-layer CNN. We precisely give the conditions under which the CNN will mainly focus on learning signals or memorizing noises, and reveals a phase transition of the population loss with respect to the sample size, signal strength, noise level, and dimension. Our result theoretically demonstrates that benign overfitting can happen in neural network training. An important future work direction is to study the benign overfitting phenomenon of neural networks in learning other data models. Moreover, it is also important to generalize our analysis to deep convolutional neural networks.

Acknowledgements

We would like to thank Spencer Frei for valuable comment and discussion on the earlier version of this paper, and pointing out a related work.

Appendix A Additional Related Work

There has also been a large number of works studying the optimization and generalization of neural networks. A series of work (Li and Yuan, 2017; Soltanolkotabi, 2017; Du et al., 2018a, b; Zhong et al., 2017; Zhang et al., 2019; Cao and Gu, 2019b) studied the parameter recovery problem in two-layer neural networks, where the data are given by a teacher network and the task is to recover the parameters in the teacher network. These works either focus on the noiseless setting, or requires the number of training data points to be larger than the number of parameters in the network, and therefore does not cover the setting where the neural network can overfit the training data. Another line of works (Neyshabur et al., 2015; Bartlett et al., 2017; Neyshabur et al., 2018; Golowich et al., 2018; Arora et al., 2018) have studied the generalization gap between the training and test losses of neural networks with uniform convergence based arguments. However, these results are not algorithm-dependent and cannot explain benign overfitting. Some recent works studied the generalization gap based on stability based arguments (Bousquet and Elisseeff, 2002; Hardt et al., 2016; Mou et al., 2017; Chen et al., 2018). A more recent line of works studied the convergence (Jacot et al., 2018; Li and Liang, 2018; Du et al., 2019b; Allen-Zhu et al., 2019b; Du et al., 2019a; Zou et al., 2019) and test error bounds (Allen-Zhu et al., 2019a; Arora et al., 2019a, b; Cao and Gu, 2019a; Ji and Telgarsky, 2020; Chen et al., 2019) of over-parameterized networks in the neural tangent kernel regime. However, these works depend on the equivalence between neural network training and kernel methods, which cannot fully explain the success of deep learning. Compared with the works mentioned above, our work has a different focus which is to study the conditions for benign and harmful overfitting.

Appendix B Preliminary Lemmas

In this section, we present some pivotal lemmas that give some important properties of the data and the neural network parameters at their random initialization.

Lemma B.1.

Suppose that δ>0\delta>0 and n8log(4/δ)n\geq 8\log(4/\delta). Then with probability at least 1δ1-\delta,

|{i[n]:yi=1}|,|{i[n]:yi=1}|n/4.\displaystyle|\{i\in[n]:y_{i}=1\}|,~{}|\{i\in[n]:y_{i}=-1\}|\geq n/4.
Proof of Lemma B.1.

By Hoeffding’s inequality, with probability at least 1δ/21-\delta/2, we have

|1ni=1n𝟙{yi=1}12|log(4/δ)2n.\displaystyle\Bigg{|}\frac{1}{n}\sum_{i=1}^{n}\operatorname{\mathds{1}}\{y_{i}=1\}-\frac{1}{2}\Bigg{|}\leq\sqrt{\frac{\log(4/\delta)}{2n}}.

Therefore, as long as n8log(4/δ)n\geq 8\log(4/\delta), we have

|{i[n]:yi=1}|=i=1n𝟙{yi=1}n2nlog(4/δ)2nn4.\displaystyle|\{i\in[n]:y_{i}=1\}|=\sum_{i=1}^{n}\operatorname{\mathds{1}}\{y_{i}=1\}\geq\frac{n}{2}-n\cdot\sqrt{\frac{\log(4/\delta)}{2n}}\geq\frac{n}{4}.

This proves the result for |{i[n]:yi=1}||\{i\in[n]:y_{i}=1\}|. The proof for |{i[n]:yi=1}||\{i\in[n]:y_{i}=-1\}| is exactly the same, and we can conclude the proof by applying a union bound. ∎

The following lemma estimates the norms of the noise vectors 𝝃i\bm{\xi}_{i}, i[n]i\in[n], and gives an upper bound of their inner products with each other.

Lemma B.2.

Suppose that δ>0\delta>0 and d=Ω(log(4n/δ))d=\Omega(\log(4n/\delta)). Then with probability at least 1δ1-\delta,

σp2d/2𝝃i223σp2d/2,\displaystyle\sigma_{p}^{2}d/2\leq\|\bm{\xi}_{i}\|_{2}^{2}\leq 3\sigma_{p}^{2}d/2,
|𝝃i,𝝃i|2σp2dlog(4n2/δ)\displaystyle|\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\rangle|\leq 2\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}

for all i,i[n]i,i^{\prime}\in[n].

Proof of Lemma B.2.

By Bernstein’s inequality, with probability at least 1δ/(2n)1-\delta/(2n) we have

|𝝃i22σp2d|=O(σp2dlog(4n/δ)).\displaystyle\big{|}\|\bm{\xi}_{i}\|_{2}^{2}-\sigma_{p}^{2}d\big{|}=O(\sigma_{p}^{2}\cdot\sqrt{d\log(4n/\delta)}).

Therefore, as long as d=Ω(log(4n/δ))d=\Omega(\log(4n/\delta)), we have

σp2d/2𝝃i223σp2d/2.\displaystyle\sigma_{p}^{2}d/2\leq\|\bm{\xi}_{i}\|_{2}^{2}\leq 3\sigma_{p}^{2}d/2.

Moreover, clearly 𝝃i,𝝃i\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\rangle has mean zero. For any i,ii,i^{\prime} with iii\neq i^{\prime}, by Bernstein’s inequality, with probability at least 1δ/(2n2)1-\delta/(2n^{2}) we have

|𝝃i,𝝃i|2σp2dlog(4n2/δ).\displaystyle|\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\rangle|\leq 2\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}.

Applying a union bound completes the proof. ∎

The following lemma studies the inner product between a randomly initialized CNN convolutional filter 𝐰j,r(0)\mathbf{w}_{j,r}^{(0)}, j{+1,1}j\in\{+1,-1\} and r[m]r\in[m] and the signal/noise vectors in the training data. The calculations characterize how the neural network at initialization randomly captures signal and noise information.

Lemma B.3.

Suppose that dΩ(log(mn/δ))d\geq\Omega(\log(mn/\delta)), m=Ω(log(1/δ))m=\Omega(\log(1/\delta)). Then with probability at least 1δ1-\delta,

|𝐰j,r(0),𝝁|2log(8m/δ)σ0𝝁2,\displaystyle|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2},
|𝐰j,r(0),𝝃i|2log(8mn/δ)σ0σpd\displaystyle|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\leq 2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}

for all r[m]r\in[m], j{±1}j\in\{\pm 1\} and i[n]i\in[n]. Moreover,

σ0𝝁2/2maxr[m]j𝐰j,r(0),𝝁2log(8m/δ)σ0𝝁2,\displaystyle\sigma_{0}\|\bm{\mu}\|_{2}/2\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2},
σ0σpd/4maxr[m]j𝐰j,r(0),𝝃i2log(8mn/δ)σ0σpd\displaystyle\sigma_{0}\sigma_{p}\sqrt{d}/4\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\leq 2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}

for all j{±1}j\in\{\pm 1\} and i[n]i\in[n].

Proof of Lemma B.3.

It is clear that for each r[m]r\in[m], j𝐰j,r(0),𝝁j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle is a Gaussian random variable with mean zero and variance σ02𝝁22\sigma_{0}^{2}\|\bm{\mu}\|_{2}^{2}. Therefore, by Gaussian tail bound and union bound, with probability at least 1δ/41-\delta/4,

j𝐰j,r(0),𝝁|𝐰j,r(0),𝝁|2log(8m/δ)σ0𝝁2.\displaystyle j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\leq|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2}.

Moreover, (σ0𝝁2/2>j𝐰j,r(0),𝝁)\mathbb{P}(\sigma_{0}\|\bm{\mu}\|_{2}/2>j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle) is an absolute constant, and therefore by the condition on mm, we have

(σ0𝝁2/2maxr[m]j𝐰j,r(0),𝝁)\displaystyle\mathbb{P}\big{(}\sigma_{0}\|\bm{\mu}\|_{2}/2\leq\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle) =1(σ0𝝁2/2>maxr[m]j𝐰j,r(0),𝝁)\displaystyle=1-\mathbb{P}(\sigma_{0}\|\bm{\mu}\|_{2}/2>\max_{r\in[m]}j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\big{)}
=1(σ0𝝁2/2>j𝐰j,r(0),𝝁)2m\displaystyle=1-\mathbb{P}\big{(}\sigma_{0}\|\bm{\mu}\|_{2}/2>j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle\big{)}^{2m}
1δ/4.\displaystyle\geq 1-\delta/4.

By Lemma B.2, with probability at least 1δ/41-\delta/4, σpd/2𝝃i23/2σpd\sigma_{p}\sqrt{d}/\sqrt{2}\leq\|\bm{\xi}_{i}\|_{2}\leq\sqrt{3/2}\cdot\sigma_{p}\sqrt{d} for all i[n]i\in[n]. Therefore, the result for 𝐰j,r(0),𝝃i\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle follows the same proof as j𝐰j,r(0),𝝁j\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle. ∎

Appendix C Signal-noise Decomposition Analysis

In this section, we establish a series of results on the signal-noise decomposition. These results are based on the conclusions in Section B, which hold with high probability. Denote by prelim\mathcal{E}_{\mathrm{prelim}} the event that all the results in Section B hold. Then for simplicity and clarity, we state all the results in this and the following sections conditional on prelim\mathcal{E}_{\mathrm{prelim}}.

Lemma C.1 (Restatement of Lemma 5.1).

The coefficients γj,r(t),ρ¯j,r,i(t),ρ¯j,r,i(t)\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)} defined in Definition 4.1 satisfy the following iterative equations:

γj,r(0),ρ¯j,r,i(0),ρ¯j,r,i(0)=0,\displaystyle\gamma_{j,r}^{(0)},\overline{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j,r,i}^{(0)}=0,
γj,r(t+1)=γj,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),yi𝝁)𝝁22,\displaystyle\gamma_{j,r}^{(t+1)}=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\|\bm{\mu}\|_{2}^{2},
ρ¯j,r,i(t+1)=ρ¯j,r,i(t)ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22𝟙(yi=j),\displaystyle\overline{\rho}_{j,r,i}^{(t+1)}=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j),
ρ¯j,r,i(t+1)=ρ¯j,r,i(t)+ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22𝟙(yi=j)\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j)

for all r[m]r\in[m], j{±1}j\in\{\pm 1\} and i[n]i\in[n].

Proof of Lemma C.1.

By our data model in Definition 3.1 and Gaussian initialization of the CNN weights, it is clear that with probability 11, the vectors are linearly independent. Therefore, the decomposition (4.1) is unique. Now consider γ~j,r(0),ρ~j,r,i(0)=0\widetilde{\gamma}_{j,r}^{(0)},\widetilde{\rho}_{j,r,i}^{(0)}=0 and

γ~j,r(t+1)=γ~j,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),yi𝝁)𝝁22,\displaystyle\widetilde{\gamma}_{j,r}^{(t+1)}=\widetilde{\gamma}_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\cdot\|\bm{\mu}\|_{2}^{2},
ρ~j,r,i(t+1)=ρ~j,r,i(t)ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22jyi,\displaystyle\widetilde{\rho}_{j,r,i}^{(t+1)}=\widetilde{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot jy_{i},

It is then easy to check by (3.1) that

𝐰j,r(t)=𝐰j,r(0)+jγ~j,r(t)𝝁22𝝁+i=1nρ~j,r,i(t)𝝃i22𝝃i.\displaystyle\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\widetilde{\gamma}_{j,r}^{(t)}\cdot\|\bm{\mu}\|_{2}^{-2}\cdot\bm{\mu}+\sum_{i=1}^{n}\widetilde{\rho}_{j,r,i}^{(t)}\|\bm{\xi}_{i}\|_{2}^{-2}\cdot\bm{\xi}_{i}.

Hence by the uniqueness of the decomposition we have γj,r(t)=γ~j,r(t)\gamma_{j,r}^{(t)}=\widetilde{\gamma}_{j,r}^{(t)} and ρj,r,i(t)=ρ~j,r,i(t)\rho_{j,r,i}^{(t)}=\widetilde{\rho}_{j,r,i}^{(t)}. Then we have that

ρj,r,i(t)=s=0t1ηnmi(s)σ(𝐰j,r(s),𝝃i)𝝃i22jyi\displaystyle\rho_{j,r,i}^{(t)}=-\sum_{s=0}^{t-1}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(s)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot jy_{i}

Moreover, note that i(t)<0\ell_{i}^{\prime(t)}<0 by the definition of the cross-entropy loss. Therefore,

ρ¯j,r,i(t)=s=0t1ηnmi(s)σ(𝐰j,r(s),𝝃i)𝝃i22𝟙(yi=j),\displaystyle\overline{\rho}_{j,r,i}^{(t)}=-\sum_{s=0}^{t-1}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(s)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=j), (C.1)
ρ¯j,r,i(t)=s=0t1ηnmi(s)σ(𝐰j,r(s),𝝃i)𝝃i22𝟙(yi=j).\displaystyle\underline{\rho}_{j,r,i}^{(t)}=-\sum_{s=0}^{t-1}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(s)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\rangle)\cdot\|\bm{\xi}_{i}\|_{2}^{2}\cdot\operatorname{\mathds{1}}(y_{i}=-j). (C.2)

Writing out the iterative versions of (C.1) and (C.2) completes the proof. ∎

We can futher plug the signal-noise decomposition (4.1) into the iterative formulas in Lemma C.1. By the second equation in Lemma C.1, we have

γj,r(t+1)\displaystyle\gamma_{j,r}^{(t+1)} =γj,r(t)ηnmi=1ni(t)σ(yi𝐰j,r(0),𝝁+yijγj,r(t))𝝁22,\displaystyle=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(y_{i}\cdot\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle+y_{i}\cdot j\cdot\gamma_{j,r}^{(t)})\cdot\|\bm{\mu}\|_{2}^{2}, (C.3)

Moreover, by the third equation in Lemma C.1, we have

ρ¯j,r,i(t+1)\displaystyle\overline{\rho}_{j,r,i}^{(t+1)} =ρ¯j,r,i(t)ηnmi(t)σ(𝐰j,r(0),𝝃i+i=1nρ¯j,r,i(t)𝝃i,𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i,𝝃i𝝃i22)𝝃i22\displaystyle=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2} (C.4)

if j=yij=y_{i}, and ρ¯j,r,i(t)=0\overline{\rho}_{j,r,i}^{(t)}=0 for all t0t\geq 0 if j=yij=-y_{i}. Similarly, by the last equation in Lemma C.1, we have

ρ¯j,r,i(t+1)\displaystyle\underline{\rho}_{j,r,i}^{(t+1)} =ρ¯j,r,i(t)+ηnmi(t)σ(𝐰j,r(0),𝝃i+i=1nρ¯j,r,i(t)𝝃i,𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i,𝝃i𝝃i22)𝝃i22\displaystyle=\underline{\rho}_{j,r,i}^{(t)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2} (C.5)

if j=yij=-y_{i}, and ρ¯j,r,i(t)=0\underline{\rho}_{j,r,i}^{(t)}=0 for all t0t\geq 0 if j=yij=y_{i}.

We will now show that the parameter of the signal-noise decomposition will stay a reasonable scale during a long time of training. Let us consider the learning period 0tT0\leq t\leq T^{*}, where T=η1poly(ϵ1,𝝁21,d1σp2,σ01,n,m,d)T^{*}=\eta^{-1}\text{poly}(\epsilon^{-1},\|\bm{\mu}\|_{2}^{-1},d^{-1}\sigma_{p}^{-2},\sigma_{0}^{-1},n,m,d) is the maximum admissible iterations. Note that we can consider any polynomial training time TT^{*}. Denote α=4log(T)\alpha=4\log(T^{*}). Here we list the exact conditions for η,σ0,d\eta,\sigma_{0},d required by the proofs in this section, which are part of Condition 4.2:

η=O(min{nm/(qσp2d),nm/(q2q+2αq2σp2d),nm/(q2q+2αq2𝝁22)}),\displaystyle\eta=O\Big{(}\min\{nm/(q\sigma_{p}^{2}d),nm/(q2^{q+2}\alpha^{q-2}\sigma_{p}^{2}d),nm/(q2^{q+2}\alpha^{q-2}\|\bm{\mu}\|_{2}^{2})\}\Big{)}, (C.6)
σ0[16log(8mn/δ)]1min{𝝁21,(σpd)1},\displaystyle\sigma_{0}\leq[16\sqrt{\log(8mn/\delta)}]^{-1}\min\{\|\bm{\mu}\|_{2}^{-1},(\sigma_{p}\sqrt{d})^{-1}\}, (C.7)
d1024log(4n2/δ)α2n2.\displaystyle d\geq 1024\log(4n^{2}/\delta)\alpha^{2}n^{2}. (C.8)

Denote β=2maxi,j,r{|𝐰j,r(0),𝝁|,|𝐰j,r(0),𝝃i|}\beta=2\max_{i,j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\}. By Lemma B.3, with probability at least 1δ1-\delta, we can upper bound β\beta by 4log(8mn/δ)σ0max{𝝁2,σpd}4\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\cdot\max\{\|\bm{\mu}\|_{2},\sigma_{p}\sqrt{d}\}. Then, by (C.7) and (C.8), it is straightforward to verify the following inequality:

4max{β,8nlog(4n2/δ)dα}1.\displaystyle 4\max\bigg{\{}\beta,8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{\}}\leq 1. (C.9)

Suppose the conditions listed in (C.6), (C.7) and (C.8) hold, we claim that for 0tT0\leq t\leq T^{*} the following property holds.

Proposition C.2 (Restatement of Proposition 5.3).

Under Condition 4.2, for 0tT0\leq t\leq T^{*}, we have that

0γj,r(t),ρ¯j,r,i(t)α,\displaystyle 0\leq\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\leq\alpha, (C.10)
0ρ¯j,r,i(t)β16nlog(4n2/δ)dαα.\displaystyle 0\geq\underline{\rho}_{j,r,i}^{(t)}\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\geq-\alpha. (C.11)

for all r[m]r\in[m], j{±1}j\in\{\pm 1\} and i[n]i\in[n].

We will use induction to prove Proposition C.2. We first introduce several technical lemmas that will be used for the proof of Proposition C.2.

Lemma C.3.

For any t0t\geq 0, it holds that 𝐰j,r(t)𝐰j,r(0),𝛍=jγj,r(t)\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle=j\cdot\gamma_{j,r}^{(t)} for all r[m]r\in[m], j{±1}j\in\{\pm 1\}.

Proof of Lemma C.3.

For any time t0t\geq 0, we have that

𝐰j,r(t)𝐰j,r(0),𝝁\displaystyle\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle =jγj,r(t)+i=1nρ¯j,r,i(t)𝝃i22𝝃i,𝝁+i=1nρ¯j,r,i(t)𝝃i22𝝃i,𝝁\displaystyle=j\cdot\gamma_{j,r}^{(t)}+\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\|\bm{\xi}_{i^{\prime}}\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\mu}\rangle+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\|\bm{\xi}_{i^{\prime}}\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\mu}\rangle
=jγj,r(t),\displaystyle=j\cdot\gamma_{j,r}^{(t)},

where the equation is by our orthogonal assumption. ∎

Lemma C.4.

Under Condition 4.2, suppose (C.10) and (C.11) hold at iteration tt. Then

ρ¯j,r,i(t)8nlog(4n2/δ)dα𝐰j,r(t)𝐰j,r(0),𝝃i\displaystyle\underline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle ρ¯j,r,i(t)+8nlog(4n2/δ)dα,jyi,\displaystyle\leq\underline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,~{}j\not=y_{i},
ρ¯j,r,i(t)8nlog(4n2/δ)dα𝐰j,r(t)𝐰j,r(0),𝝃i\displaystyle\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle ρ¯j,r,i(t)+8nlog(4n2/δ)dα,j=yi\displaystyle\leq\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,~{}j=y_{i}

for all r[m]r\in[m], j{±1}j\in\{\pm 1\} and i[n]i\in[n].

Proof of Lemma C.4.

For jyij\not=y_{i}, we have that ρ¯j,r,i(t)=0\overline{\rho}_{j,r,i}^{(t)}=0 and

𝐰j,r(t)𝐰j,r(0),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle =i=1nρ¯j,r,i(t)𝝃i22𝝃i,𝝃i+i=1nρ¯j,r,i(t)𝝃i22𝝃i,𝝃i\displaystyle=\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\|\bm{\xi}_{i^{\prime}}\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\|\bm{\xi}_{i^{\prime}}\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle
4log(4n2/δ)dii|ρ¯j,r,i(t)|+4log(4n2/δ)dii|ρ¯j,r,i(t)|+ρ¯j,r,i(t)\displaystyle\leq 4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}|\overline{\rho}_{j,r,i^{\prime}}^{(t)}|+4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}|\underline{\rho}_{j,r,i^{\prime}}^{(t)}|+\underline{\rho}_{j,r,i}^{(t)}
ρ¯j,r,i(t)+8nlog(4n2/δ)dα,\displaystyle\leq\underline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

where the second inequality is by Lemma B.2 and the last inequality is by |ρ¯j,r,i(t)|,|ρ¯j,r,i(t)|α|\overline{\rho}^{(t)}_{j,r,i^{\prime}}|,|\underline{\rho}_{j,r,i^{\prime}}^{(t)}|\leq\alpha in (C.10) Similarly, for yi=jy_{i}=j, we have that ρ¯j,r,i(t)=0\underline{\rho}_{j,r,i}^{(t)}=0 and

𝐰j,r(t)𝐰j,r(0),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle =i=1nρ¯j,r,i(t)𝝃i22𝝃i,𝝃i+i=1nρ¯j,r,i(t)𝝃i22𝝃i,𝝃i\displaystyle=\sum_{i^{\prime}=1}^{n}\overline{\rho}_{j,r,i^{\prime}}^{(t)}\|\bm{\xi}_{i^{\prime}}\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\underline{\rho}_{j,r,i^{\prime}}^{(t)}\|\bm{\xi}_{i^{\prime}}\|_{2}^{-2}\cdot\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle
ρ¯j,r,i(t)+4log(4n2/δ)dii|ρ¯j,r,i(t)|+4log(4n2/δ)dii|ρ¯j,r,i(t)|\displaystyle\leq\overline{\rho}_{j,r,i}^{(t)}+4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}|\overline{\rho}_{j,r,i^{\prime}}^{(t)}|+4\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\sum_{i^{\prime}\not=i}|\underline{\rho}_{j,r,i^{\prime}}^{(t)}|
ρ¯j,r,i(t)+8nlog(4n2/δ)dα,\displaystyle\leq\overline{\rho}^{(t)}_{j,r,i}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

where the first inequality is by Lemma B.1 and the second inequality is by |ρ¯j,r,i(t)|,|ρ¯j,r,i(t)|α|\overline{\rho}^{(t)}_{j,r,i^{\prime}}|,|\underline{\rho}_{j,r,i^{\prime}}^{(t)}|\leq\alpha in (C.10). Similarly, we can show that 𝐰j,r(t)𝐰j,r(0),𝝃iρ¯j,r,i(t)8nlog(4n2/δ)/dα\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\geq\underline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\log(4n^{2}/\delta)/d}\cdot\alpha and 𝐰j,r(t)𝐰j,r(0),𝝃iρ¯j,r,i(t)8nlog(4n2/δ)/dα\langle\mathbf{w}_{j,r}^{(t)}-\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\geq\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\log(4n^{2}/\delta)/d}\cdot\alpha, which completes the proof. ∎

Lemma C.5.

Under Condition 4.2, suppose (C.10) and (C.11) hold at iteration tt. Then

𝐰j,r(t),yi𝝁\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle 𝐰j,r(0),yi𝝁,\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle,
𝐰j,r(t),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle 𝐰j,r(0),𝝃i+8nlog(4n2/δ)dα,\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,
Fj(𝐖j(t),𝐱i)\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i}) 1\displaystyle\leq 1

for all r[m]r\in[m] and jyij\not=y_{i}.

Proof of Lemma C.5.

For jyij\not=y_{i}, we have that

𝐰j,r(t),yi𝝁=𝐰j,r(0),yi𝝁+yijγj,r(t)𝐰j,r(0),yi𝝁,\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle=\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle+y_{i}\cdot j\cdot\gamma_{j,r}^{(t)}\leq\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle, (C.12)

where the inequality is by γj,r(t)0\gamma_{j,r}^{(t)}\geq 0. In addition, we have

𝐰j,r(t),𝝃i𝐰j,r(0),𝝃i+ρ¯j,r,i(t)+8nlog(4n2/δ)dα𝐰j,r(0),𝝃i+8nlog(4n2/δ)dα,\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\underline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha, (C.13)

where the first inequality is by Lemma C.4 and the second inequality is due to ρ¯j,r,i(t)0\underline{\rho}_{j,r,i}^{(t)}\leq 0. Then we can get that

Fj(𝐖j(t),𝐱i)\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i}) =1mr=1m[σ(𝐰j,r(t),j𝝁)+σ(𝐰j,r(t),𝝃i)]\displaystyle=\frac{1}{m}\sum_{r=1}^{m}[\sigma(\langle\mathbf{w}_{j,r}^{(t)},-j\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)]
2q+1maxj,r,i{|𝐰j,r(0),𝝁|,|𝐰j,r(0),𝝃i|,8nlog(4n2/δ)dα}q\displaystyle\leq 2^{q+1}\max_{j,r,i}\bigg{\{}|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|,8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{\}}^{q}
1,\displaystyle\leq 1,

where the first inequality is by (C.12), (C.13) and the second inequality is by (C.9). ∎

Lemma C.6.

Under Condition 4.2, suppose (C.10) and (C.11) hold at iteration tt. Then

𝐰j,r(t),yi𝝁\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle =𝐰j,r(0),yi𝝁+γj,r(t),\displaystyle=\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle+\gamma_{j,r}^{(t)},
𝐰j,r(t),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle 𝐰j,r(0),𝝃i+ρ¯j,r,i(t)+8nlog(4n2/δ)dα\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha

for all r[m]r\in[m], j{±1}j\in\{\pm 1\} and i[n]i\in[n]. If max{γj,r(t),ρ¯j,r,i(t)}=O(1)\max\{\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\}=O(1), we further have that Fj(𝐖j(t),𝐱i)=O(1)F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i})=O(1).

Proof of Lemma C.6.

For j=yij=y_{i}, we have that

𝐰j,r(t),yi𝝁=𝐰j,r(0),yi𝝁+γj,r(t),\displaystyle\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle=\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle+\gamma_{j,r}^{(t)}, (C.14)

where the equation is by Lemma C.3. We also have that

𝐰j,r(t),𝝃i𝐰j,r(0),𝝃i+ρ¯j,r,i(t)+8nlog(4n2/δ)dα,\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha, (C.15)

where the inequality is by Lemma C.4. If max{γj,r(t),ρ¯j,r,i(t)}=O(1)\max\{\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)}\}=O(1), we have following bound

Fj(𝐖j(t),𝐱i)\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}_{i}) =1mr=1m[σ(𝐰j,r(t),j𝝁)+σ(𝐰j,r(t),𝝃i)]\displaystyle=\frac{1}{m}\sum_{r=1}^{m}[\sigma(\langle\mathbf{w}_{j,r}^{(t)},-j\cdot\bm{\mu})+\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)]
23qmaxj,r,i{γj,r(t),ρ¯j,r,i(t),|𝐰j,r(0),𝝁|,|𝐰j,r(0),𝝃i|,8nlog(4n2/δ)dα}q\displaystyle\leq 2\cdot 3^{q}\max_{j,r,i}\bigg{\{}\gamma_{j,r}^{(t)},\overline{\rho}_{j,r,i}^{(t)},|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|,8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{\}}^{q}
=O(1),\displaystyle=O(1),

where the first inequality is by (C.14), (C.15) and the second inequality is by (C.9) where β=2maxi,j,r{|𝐰j,r(0),𝝁|,|𝐰j,r(0),𝝃i|}\beta=2\max_{i,j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|,|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\}. ∎

Now we are ready to prove Proposition C.2.

Proof of Proposition C.2.

Our proof is based on induction. The results are obvious at t=0t=0 as all the coefficients are zero. Suppose that there exists T~T\widetilde{T}\leq T^{*} such that the results in Proposition C.2 hold for all time 0tT~10\leq t\leq\widetilde{T}-1. We aim to prove that they also hold for t=T~t=\widetilde{T}.

We first prove that (C.11) holds for t=T~t=\widetilde{T}, i.e., ρ¯j,r,i(t)β16nlog(4n2/δ)dα\underline{\rho}^{(t)}_{j,r,i}\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha for t=T~t=\widetilde{T}, r[m]r\in[m], j{±1}j\in\{\pm 1\} and i[n]i\in[n]. Notice that ρ¯j,r,i(t)=0,j=yi\underline{\rho}_{j,r,i}^{(t)}=0,\forall j=y_{i}. Therefore, we only need to consider the case that jyij\not=y_{i}. When ρ¯j,r,i(T~1)0.5β8nlog(4n2/δ)dα\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}\leq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha, by Lemma C.4 we have that

𝐰j,r(T~1),𝝃iρ¯j,r,i(T~1)+𝐰j,r(0),𝝃i+8nlog(4n2/δ)dα0,\displaystyle\langle\mathbf{w}_{j,r}^{(\widetilde{T}-1)},\bm{\xi}_{i}\rangle\leq\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}+\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0,

and thus

ρ¯j,r,i(T~)\displaystyle\underline{\rho}_{j,r,i}^{(\widetilde{T})} =ρ¯j,r,i(T~1)+ηnmi(T~1)σ(𝐰j,r(T~1),𝝃i)𝟙(yi=j)𝝃i22\displaystyle=\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(\widetilde{T}-1)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(\widetilde{T}-1)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=-j)\|\bm{\xi}_{i}\|_{2}^{2}
=ρ¯j,r,i(T~1)\displaystyle=\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}
β16nlog(4n2/δ)dα,\displaystyle\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

where the last inequality is by induction hypothesis. When ρ¯j,r,i(T~1)0.5β8nlog(4n2/δ)dα\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}\geq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha, we have that

ρ¯j,r,i(T~)\displaystyle\underline{\rho}_{j,r,i}^{(\widetilde{T})} =ρ¯j,r,i(T~1)+ηnmi(T~1)σ(𝐰j,r(T1),𝝃i)𝟙(yi=j)𝝃i22\displaystyle=\underline{\rho}_{j,r,i}^{(\widetilde{T}-1)}+\frac{\eta}{nm}\cdot\ell_{i}^{\prime(\widetilde{T}-1)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(T-1)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=-j)\|\bm{\xi}_{i}\|_{2}^{2}
0.5β8nlog(4n2/δ)dαO(ησp2dnm)σ(0.5β+8nlog(4n2/δ)dα)\displaystyle\geq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha-O\bigg{(}\frac{\eta\sigma_{p}^{2}d}{nm}\bigg{)}\sigma^{\prime}\bigg{(}0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{)}
0.5β8nlog(4n2/δ)dαO(ηqσp2dnm)(0.5β+8nlog(4n2/δ)dα)\displaystyle\geq-0.5\beta-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha-O\bigg{(}\frac{\eta q\sigma_{p}^{2}d}{nm}\bigg{)}\bigg{(}0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{)}
β16nlog(4n2/δ)dα,\displaystyle\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

where we use i(T~1)1\ell_{i}^{\prime(\widetilde{T}-1)}\leq 1 and 𝝃i2=O(σp2d)\|\bm{\xi}_{i}\|_{2}=O(\sigma_{p}^{2}d) in the first inequality, the second inequality is by 0.5β+8nlog(4n2/δ)dα10.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 1, and the last inequality is by η=O(nm/(qσp2d))\eta=O\big{(}nm/(q\sigma_{p}^{2}d)\big{)} in (C.6).

Next we prove (C.10) holds for t=T~t=\widetilde{T}. We have

|i(t)|\displaystyle|\ell_{i}^{\prime(t)}| =11+exp{yi[F+1(𝐖+1(t),𝐱i)F1(𝐖1(t),𝐱i)]}\displaystyle=\frac{1}{1+\exp\{y_{i}\cdot[F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})-F_{-1}(\mathbf{W}_{-1}^{(t)},\mathbf{x}_{i})]\}}
exp{yi[F+1(𝐖+1(t),𝐱i)F1(𝐖1(t),𝐱i)]}\displaystyle\leq\exp\{-y_{i}\cdot[F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})-F_{-1}(\mathbf{W}_{-1}^{(t)},\mathbf{x}_{i})]\}
exp{Fyi(𝐖yi(t),𝐱i)+1}.\displaystyle\leq\exp\{-F_{y_{i}}(\mathbf{W}_{y_{i}}^{(t)},\mathbf{x}_{i})+1\}. (C.16)

where the last inequality is due to Lemma C.5. Moreover, recall the update rule of γj,r(t)\gamma_{j,r}^{(t)} and ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)},

γj,r(t+1)\displaystyle\gamma_{j,r}^{(t+1)} =γj,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),yi𝝁)𝝁22,\displaystyle=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\|\bm{\mu}\|_{2}^{2},
ρ¯j,r,i(t+1)\displaystyle\overline{\rho}_{j,r,i}^{(t+1)} =ρ¯j,r,i(t)ηnmi(t)σ(𝐰j,r(t),𝝃i)𝟙(yi=j)𝝃i22.\displaystyle=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\|\bm{\xi}_{i}\|_{2}^{2}.

Let tj,r,it_{j,r,i} to be the last time t<Tt<T^{*} that ρ¯j,r,i(t)0.5α\overline{\rho}_{j,r,i}^{(t)}\leq 0.5\alpha. Then we have that

ρ¯j,r,i(T~)\displaystyle\overline{\rho}_{j,r,i}^{(\widetilde{T})} =ρ¯j,r,i(tj,r,i)ηnmi(tj,r,i)σ(𝐰j,r(tj,r,i),𝝃i)𝟙(yi=j)𝝃i22I1\displaystyle=\overline{\rho}_{j,r,i}^{(t_{j,r,i})}-\underbrace{\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t_{j,r,i})}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t_{j,r,i})},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\|\bm{\xi}_{i}\|_{2}^{2}}_{I_{1}}
tj,r,i<t<Tηnmi(t)σ(𝐰j,r(t),𝝃i)𝟙(yi=j)𝝃i22I2.\displaystyle\qquad-\underbrace{\sum_{t_{j,r,i}<t<T}\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\|\bm{\xi}_{i}\|_{2}^{2}}_{I_{2}}. (C.17)

We first bound I1I_{1} as follows,

|I1|2qn1m1η(ρ¯j,r,i(tj,r,i)+0.5β+8nlog(4n2/δ)dα)q1σp2dq2qn1m1ηαq1σp2d0.25α,\displaystyle|I_{1}|\leq 2qn^{-1}m^{-1}\eta\bigg{(}\overline{\rho}_{j,r,i}^{(t_{j,r,i})}+0.5\beta+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\bigg{)}^{q-1}\sigma_{p}^{2}d\leq q2^{q}n^{-1}m^{-1}\eta\alpha^{q-1}\sigma_{p}^{2}d\leq 0.25\alpha,

where the first inequality is by Lemmas C.4 and B.2, the second inequality is by β0.1α\beta\leq 0.1\alpha and 8nlog(4n2/δ)dα0.1α8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0.1\alpha, the last inequality is by ηnm/(q2q+2αq2σp2d)\eta\leq nm/(q2^{q+2}\alpha^{q-2}\sigma_{p}^{2}d).

Second, we bound I2I_{2}. For tj,r,i<t<T~t_{j,r,i}<t<\widetilde{T} and yi=jy_{i}=j, we can lower bound 𝐰j,r(t),𝝃i\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle as follows,

𝐰j,r(t),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle 𝐰j,r(0),𝝃i+ρ¯j,r,i(t)8nlog(4n2/δ)dα\displaystyle\geq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
0.5β+0.5α8nlog(4n2/δ)dα\displaystyle\geq-0.5\beta+0.5\alpha-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
0.25α,\displaystyle\geq 0.25\alpha,

where the first inequality is by Lemma C.4, the second inequality is by ρ¯j,r,i(t)>0.5α\overline{\rho}_{j,r,i}^{(t)}>0.5\alpha and 𝐰j,r(0),𝝃i0.5β\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle\geq-0.5\beta due to the definition of tj,r,it_{j,r,i} and β\beta, the last inequality is by β0.1α\beta\leq 0.1\alpha and 8nlog(4n2/δ)dα0.1α8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0.1\alpha. Similarly, for tj,r,i<t<T~t_{j,r,i}<t<\widetilde{T} and yi=jy_{i}=j, we can also upper bound 𝐰j,r(t),𝝃i\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle as follows,

𝐰j,r(t),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle 𝐰j,r(0),𝝃i+ρ¯j,r,i(t)+8nlog(4n2/δ)dα\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
0.5β+α+8nlog(4n2/δ)dα\displaystyle\leq 0.5\beta+\alpha+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
2α,\displaystyle\leq 2\alpha,

where the first inequality is by Lemma C.4, the second inequality is by induction hypothesis ρ¯j,r,i(t)α\overline{\rho}_{j,r,i}^{(t)}\leq\alpha, the last inequality is by β0.1α\beta\leq 0.1\alpha and 8nlog(4n2/δ)dα0.1α8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 0.1\alpha. Thus, plugging the upper and lower bounds of 𝐰j,r(t),𝝃i\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle into I2I_{2} gives

|I2|\displaystyle|I_{2}| tj,r,i<t<T~ηnmexp(σ(𝐰j,r(t),𝝃i)+1)σ(𝐰j,r(t),𝝃i)𝟙(yi=j)𝝃i22\displaystyle\leq\sum_{t_{j,r,i}<t<\widetilde{T}}\frac{\eta}{nm}\cdot\exp(-\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)+1)\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\|\bm{\xi}_{i}\|_{2}^{2}
eq2qηTnexp(αq/4q)αq1σp2d\displaystyle\leq\frac{eq2^{q}\eta T^{*}}{n}\exp(-\alpha^{q}/4^{q})\alpha^{q-1}\sigma_{p}^{2}d
0.25Texp(αq/4q)α\displaystyle\leq 0.25T^{*}\exp(-\alpha^{q}/4^{q})\alpha
0.25Texp(log(T)q)α\displaystyle\leq 0.25T^{*}\exp(-\log(T^{*})^{q})\alpha
0.25α,\displaystyle\leq 0.25\alpha,

where the first inequality is by (C.16), the second inequality is by Lemma B.2, the third inequality is by η=O(nm/(q2q+2αq2σp2d))\eta=O\big{(}nm/(q2^{q+2}\alpha^{q-2}\sigma_{p}^{2}d)\big{)} in (C.6), the fourth inequality is by our choice of α=4log(T)\alpha=4\log(T^{*}) and the last inequality is due to the fact that log(T)qlog(T)\log(T^{*})^{q}\geq\log(T^{*}). Plugging the bound of I1,I2I_{1},I_{2} into (C.17) completes the proof for ρ¯\overline{\rho}. Similarly, we can prove that γj,r(T~)α\gamma_{j,r}^{(\widetilde{T})}\leq\alpha using η=O(nm/(q2q+2αq2𝝁22))\eta=O\big{(}nm/(q2^{q+2}\alpha^{q-2}\|\bm{\mu}\|_{2}^{2})\big{)} in (C.6). Therefore Proposition C.2 holds for t=T~t=\widetilde{T}, which completes the induction. ∎

Based on Proposition C.2, we introduce some important properties of the training loss function for 0tT0\leq t\leq T^{*}.

Lemma C.7 (Restatement of Lemma 5.4).

Under Condition 4.2, for 0tT0\leq t\leq T^{*}, the following result holds.

LS(𝐖(t))F2O(max{𝝁22,σp2d})LS(𝐖(t)).\displaystyle\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}\leq O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\})L_{S}(\mathbf{W}^{(t)}).
Proof of Lemma C.7.

We first prove that

(yif(𝐖(t),𝐱i))f(𝐖(t),𝐱i)F2=O(max{𝝁22,σp2d}).\displaystyle-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))\cdot\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}\big{)}\|_{F}^{2}=O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\}). (C.18)

Without loss of generality, we suppose that yi=1y_{i}=1 and 𝐱i=[𝝁,𝝃i]\mathbf{x}_{i}=[\bm{\mu}^{\top},\bm{\xi}_{i}]. Then we have that

f(𝐖(t),𝐱i)F\displaystyle\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\|_{F} 1mj,r[σ(𝐰j,r(t),𝝁)𝝁+σ(𝐰j,r(t),𝝃i)𝝃i]2\displaystyle\leq\frac{1}{m}\sum_{j,r}\bigg{\|}\big{[}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle)\bm{\mu}+\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\bm{\xi}_{i}\big{]}\bigg{\|}_{2}
1mj,rσ(𝐰j,r(t),𝝁)𝝁2+1mj,rσ(𝐰j,r(t),𝝃i)𝝃i2\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle)\|\bm{\mu}\|_{2}+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\|\bm{\xi}_{i}\|_{2}
2q[F+1(𝐖+1(t),𝐱i)](q1)/qmax{𝝁2,2σpd}\displaystyle\leq 2q\bigg{[}F_{+1}(\mathbf{W}^{(t)}_{+1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\|\bm{\mu}\|_{2},2\sigma_{p}\sqrt{d}\}
+2q[F1(𝐖1(t),𝐱i)](q1)/qmax{𝝁2,2σpd}\displaystyle\qquad+2q\bigg{[}F_{-1}(\mathbf{W}^{(t)}_{-1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\|\bm{\mu}\|_{2},2\sigma_{p}\sqrt{d}\}
2q[F+1(𝐖+1(t),𝐱i)](q1)/qmax{𝝁2,2σpd}+2qmax{𝝁2,2σpd},\displaystyle\leq 2q\bigg{[}F_{+1}(\mathbf{W}^{(t)}_{+1},\mathbf{x}_{i})\bigg{]}^{(q-1)/q}\max\{\|\bm{\mu}\|_{2},2\sigma_{p}\sqrt{d}\}+2q\max\{\|\bm{\mu}\|_{2},2\sigma_{p}\sqrt{d}\},

where the first and second inequalities are by triangle inequality, the third inequality is by Jensen’s inequality and Lemma B.2, and the last inequality is due to Lemma C.5. Denote A=F+1(𝐖+1(t),𝐱i)A=F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i}). Then we have that A0A\geq 0, and besides, F1(𝐖1(t),𝐱i)1F_{-1}(\mathbf{W}^{(t)}_{-1},\mathbf{x}_{i})\leq 1 by Lemma C.5. Then we have that

(yif(𝐖(t),𝐱i))f(𝐖(t),𝐱i)F2\displaystyle-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\cdot\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\|_{F}^{2}
(A1)(2qA(q1)/qmax{𝝁2,2σpd}+2qmax{𝝁2,2σpd})2\displaystyle\qquad\leq-\ell^{\prime}(A-1)\bigg{(}2q\cdot A^{(q-1)/q}\max\{\|\bm{\mu}\|_{2},2\sigma_{p}\sqrt{d}\}+2q\cdot\max\{\|\bm{\mu}\|_{2},2\sigma_{p}\sqrt{d}\}\bigg{)}^{2}
=4q2(A1)(A(q1)/q+1)2max{𝝁22,4σp2d}\displaystyle\qquad=-4q^{2}\ell^{\prime}(A-1)(A^{(q-1)/q}+1)^{2}\cdot\max\{\|\bm{\mu}\|_{2}^{2},4\sigma_{p}^{2}d\}
(maxz>04q2(z1)(z(q1)/q+1)2)max{𝝁22,4σp2d}\displaystyle\qquad\leq\big{(}\max_{z>0}-4q^{2}\ell^{\prime}(z-1)(z^{(q-1)/q}+1)^{2}\big{)}\max\{\|\bm{\mu}\|_{2}^{2},4\sigma_{p}^{2}d\}
=(i)O(max{𝝁22,σp2d}),\displaystyle\qquad\overset{(i)}{=}O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\}),

where (i) is by maxz04q2(z1)(z(q1)/q+1)2<\max_{z\geq 0}-4q^{2}\ell^{\prime}(z-1)(z^{(q-1)/q}+1)^{2}<\infty because \ell^{\prime} has an exponentially decaying tail. Now we can upper bound the gradient norm LS(𝐖(t))F\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F} as follows,

LS(𝐖(t))F2\displaystyle\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2} [1ni=1n(yif(𝐖(t),𝐱i))f(𝐖(t),𝐱i)F]2\displaystyle\leq\bigg{[}\frac{1}{n}\sum_{i=1}^{n}\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\|\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i})\|_{F}\bigg{]}^{2}
[1ni=1nO(max{𝝁22,σp2d})(yif(𝐖(t),𝐱i))]2\displaystyle\leq\bigg{[}\frac{1}{n}\sum_{i=1}^{n}\sqrt{-O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\})\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}}\bigg{]}^{2}
O(max{𝝁22,σp2d})1ni=1n(yif(𝐖(t),𝐱i))\displaystyle\leq O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\})\cdot\frac{1}{n}\sum_{i=1}^{n}-\ell^{\prime}\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}
O(max{𝝁22,σp2d})LS(𝐖(t)),\displaystyle\leq O(\max\{\|\bm{\mu}\|_{2}^{2},\sigma_{p}^{2}d\})L_{S}(\mathbf{W}^{(t)}),

where the first inequality is by triangle inequality, the second inequality is by (C.18), the third inequality is by Cauchy-Schwartz inequality and the last inequality is due to the property of the cross entropy loss -\ell^{\prime}\leq\ell. ∎

Appendix D Signal Learning

In this section, we consider the signal learning case under the condition that n𝝁2qΩ~(σpq(d)q)n\|\bm{\mu}\|_{2}^{q}\geq\widetilde{\Omega}(\sigma_{p}^{q}(\sqrt{d})^{q}). We remind the readers that the proofs in this section are based on the results in Section B, which hold with high probability.

D.1 First stage

Lemma D.1 (Restatement of Lemma 5.5).

Under the same conditions as Theorem 4.3, in particular if we choose

nSNRqClog(6/σ0𝝁2)22q+6[4log(8mn/δ)](q1)/2,\displaystyle n\cdot\mathrm{SNR}^{q}\geq C\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{2q+6}[4\log(8mn/\delta)]^{(q-1)/2}, (D.1)

where C=O(1)C=O(1) is a positive constant, there exists time

T1=Clog(6/σ0𝝁2)2q+1mησ0q2𝝁2q\displaystyle T_{1}=\frac{C\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{q+1}m}{\eta\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}

such that

  • maxrγj,r(T1)2\max_{r}\gamma_{j,r}^{(T_{1})}\geq 2 for j{±1}j\in\{\pm 1\}.

  • |ρj,r,i(t)|σ0σpd/2|\rho_{j,r,i}^{(t)}|\leq\sigma_{0}\sigma_{p}\sqrt{d}/2 for all j{±1},r[m]j\in\{\pm 1\},r\in[m], i[n]i\in[n] and 0tT10\leq t\leq T_{1}.

Proof of Lemma D.1.

Let

T1+=nmη1σ02qσpqdq/22q+4q[4log(8mn/δ)](q1)/2.\displaystyle T_{1}^{+}=\frac{nm\eta^{-1}\sigma_{0}^{2-q}\sigma_{p}^{-q}d^{-q/2}}{2^{q+4}q[4\log(8mn/\delta)]^{(q-1)/2}}. (D.2)

We first prove the second bullet. Define Ψ(t)=maxj,r,i|ρj,r,i(t)|=maxj,r,i{ρ¯j,r,i(t),ρ¯j,r,i(t)}\Psi^{(t)}=\max_{j,r,i}|\rho_{j,r,i}^{(t)}|=\max_{j,r,i}\{\overline{\rho}_{j,r,i}^{(t)},-\underline{\rho}_{j,r,i}^{(t)}\}. We use induction to show that

Ψ(t)σ0σpd/2\displaystyle\Psi^{(t)}\leq\sigma_{0}\sigma_{p}\sqrt{d}/2 (D.3)

for all 0tT1+0\leq t\leq T_{1}^{+}. By definition, clearly we have Ψ(0)=0\Psi^{(0)}=0. Now suppose that there exists some T~T1+\widetilde{T}\leq T_{1}^{+} such that (D.3) holds for 0<tT~10<t\leq\widetilde{T}-1. Then by (C.4) and (C.5) we have

Ψ(t+1)\displaystyle\Psi^{(t+1)} Ψ(t)+maxj,r,i{ηnm|i(t)|σ(𝐰j,r(0),𝝃i+i=1nΨ(t)|𝝃i,𝝃i|𝝃i22+i=1nΨ(t)|𝝃i,𝝃i|𝝃i22)𝝃i22}\displaystyle\leq\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot|\ell_{i}^{\prime(t)}|\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}+\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2}\bigg{\}}
Ψ(t)+maxj,r,i{ηnmσ(𝐰j,r(0),𝝃i+2i=1nΨ(t)|𝝃i,𝝃i|𝝃i22)𝝃i22}\displaystyle\leq\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\cdot\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2}\bigg{\}}
=Ψ(t)+maxj,r,i{ηnmσ(𝐰j,r(0),𝝃i+2Ψ(t)+2iinΨ(t)|𝝃i,𝝃i|𝝃i22)𝝃i22}\displaystyle=\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\Psi^{(t)}+2\cdot\sum_{i^{\prime}\neq i}^{n}\Psi^{(t)}\cdot\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i}\|_{2}^{2}\bigg{\}}
Ψ(t)+ηqnm[2log(8mn/δ)σ0σpd+(2+4nσp2dlog(4n2/δ)σp2d/2)Ψ(t)]q12σp2d\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\Bigg{[}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+\Bigg{(}2+\frac{4n\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}}{\sigma_{p}^{2}d/2}\Bigg{)}\cdot\Psi^{(t)}\Bigg{]}^{q-1}\cdot 2\sigma_{p}^{2}d
Ψ(t)+ηqnm(2log(8mn/δ)σ0σpd+4Ψ(t))q12σp2d\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\big{(}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+4\Psi^{(t)}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d
Ψ(t)+ηqnm(4log(8mn/δ)σ0σpd)q12σp2d,\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\big{(}4\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d,

where the second inequality is by |i(t)|1|\ell_{i}^{\prime(t)}|\leq 1, the third inequality is due to Lemmas B.2 and B.3, the fourth inequality follows by the condition that d16n2log(4n2/δ)d\geq 16n^{2}\log(4n^{2}/\delta), and the last inequality follows by the induction hypothesis (D.3). Taking a telescoping sum over t=0,1,,T~1t=0,1,\ldots,\widetilde{T}-1 then gives

Ψ(T~)\displaystyle\Psi^{(\widetilde{T})} T~ηqnm(4log(8mn/δ)σ0σpd)q12σp2d\displaystyle\leq\widetilde{T}\cdot\frac{\eta q}{nm}\cdot\big{(}4\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d
T1+ηqnm(4log(8mn/δ)σ0σpd)q12σp2d\displaystyle\leq T_{1}^{+}\cdot\frac{\eta q}{nm}\cdot\big{(}4\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d
σ0σpd2,\displaystyle\leq\frac{\sigma_{0}\sigma_{p}\sqrt{d}}{2},

where the second inequality follows by T~T1+\widetilde{T}\leq T_{1}^{+} in our induction hypothesis. Therefore, by induction, we have Ψ(t)σ0σpd/2\Psi^{(t)}\leq\sigma_{0}\sigma_{p}\sqrt{d}/2 for all tT1+t\leq T_{1}^{+}.

Now, without loss of generality, let us consider j=1j=1 first. Denote by T1,1T_{1,1} the last time for tt in the period [0,T1+][0,T_{1}^{+}] satisfying that maxrγ1,r(t)2\max_{r}\gamma_{1,r}^{(t)}\leq 2. Then for tT1,1t\leq T_{1,1}, maxj,r,i{|ρj,r,i(t)|}=O(σ0σpd)=O(1)\max_{j,r,i}\{|\rho_{j,r,i}^{(t)}|\}=O(\sigma_{0}\sigma_{p}\sqrt{d})=O(1) and maxrγ1,r(t)2\max_{r}\gamma_{1,r}^{(t)}\leq 2. Therefore, by Lemmas C.5 and C.6, we know that F1(𝐖1(t),𝐱i),F+1(𝐖+1(t),𝐱i)=O(1)F_{-1}(\mathbf{W}_{-1}^{(t)},\mathbf{x}_{i}),F_{+1}(\mathbf{W}_{+1}^{(t)},\mathbf{x}_{i})=O(1) for all ii with yi=1y_{i}=1. Thus there exists a positive constant C1C_{1} such that i(t)C1-\ell^{\prime(t)}_{i}\geq C_{1} for all ii with yi=1y_{i}=1.

By (C.3), for tT1,1t\leq T_{1,1} we have

γ1,r(t+1)\displaystyle\gamma_{1,r}^{(t+1)} =γ1,r(t)ηnmi=1ni(t)σ(yi𝐰1,r(0),𝝁+yiγ1,r(t))𝝁22\displaystyle=\gamma_{1,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(y_{i}\cdot\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle+y_{i}\cdot\gamma_{1,r}^{(t)})\cdot\|\bm{\mu}\|_{2}^{2}
γ1,r(t)+C1ηnmyi=1σ(𝐰1,r(0),𝝁+γ1,r(t))𝝁22.\displaystyle\geq\gamma_{1,r}^{(t)}+\frac{C_{1}\eta}{nm}\cdot\sum_{y_{i}=1}\sigma^{\prime}(\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle+\gamma_{1,r}^{(t)})\cdot\|\bm{\mu}\|_{2}^{2}.

Denote γ^1,r(t)=γ1,r(t)+𝐰1,r(0),𝝁\widehat{\gamma}_{1,r}^{(t)}=\gamma_{1,r}^{(t)}+\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle and let A(t)=maxrγ^1,r(t)A^{(t)}=\max_{r}\widehat{\gamma}_{1,r}^{(t)}. Then we have

A(t+1)\displaystyle A^{(t+1)} A(t)+C1ηnmyi=1σ(A(t))𝝁22\displaystyle\geq A^{(t)}+\frac{C_{1}\eta}{nm}\cdot\sum_{y_{i}=1}\sigma^{\prime}(A^{(t)})\cdot\|\bm{\mu}\|_{2}^{2}
A(t)+C1ηq𝝁224m[A(t)]q1\displaystyle\geq A^{(t)}+\frac{C_{1}\eta q\|\bm{\mu}\|_{2}^{2}}{4m}\big{[}A^{(t)}\big{]}^{q-1}
(1+C1ηq𝝁224m[A(0)]q2)A(t)\displaystyle\geq\bigg{(}1+\frac{C_{1}\eta q\|\bm{\mu}\|_{2}^{2}}{4m}\big{[}A^{(0)}\big{]}^{q-2}\bigg{)}A^{(t)}
(1+C1ηqσ0q2𝝁2q2qm)A(t),\displaystyle\geq\bigg{(}1+\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q}m}\bigg{)}A^{(t)},

where the second inequality is by the lower bound on the number of positive data in Lemma B.1, the third inequality is due to the fact that A(t)A^{(t)} is an increasing sequence, and the last inequality follows by A(0)=maxr𝐰1,r(0),𝝁σ0𝝁2/2A^{(0)}=\max_{r}\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle\geq\sigma_{0}\|\bm{\mu}\|_{2}/2 proved in Lemma B.3. Therefore, the sequence A(t)A^{(t)} will exponentially grow and we have that

A(t)\displaystyle A^{(t)} (1+C1ηqσ0q2𝝁2q2qm)tA(0)exp(C1ηqσ0q2𝝁2q2q+1mt)A(0)exp(C1ηqσ0q2𝝁2q2q+1mt)σ0𝝁22,\displaystyle\geq\bigg{(}1+\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q}m}\bigg{)}^{t}A^{(0)}\geq\exp\bigg{(}\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q+1}m}t\bigg{)}A^{(0)}\geq\exp\bigg{(}\frac{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}{2^{q+1}m}t\bigg{)}\frac{\sigma_{0}\|\bm{\mu}\|_{2}}{2},

where the second inequality is due to the fact that 1+zexp(z/2)1+z\geq\exp(z/2) for z2z\leq 2 and our condition of η\eta and σ0\sigma_{0} listed in Condition 4.2, and the last inequality follows by Lemma B.3 and A(0)=maxr𝐰1,r(0),𝝁A^{(0)}=\max_{r}\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle. Therefore, A(t)A^{(t)} will reach 33 within T1=log(6/σ0𝝁2)2q+1mC1ηqσ0q2𝝁2qT_{1}=\frac{\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{q+1}m}{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}} iterations. Since maxrγ1,r(t)A(t)maxr|𝐰1,r(0),𝝁|A(t)1\max_{r}\gamma_{1,r}^{(t)}\geq A^{(t)}-\max_{r}|\langle\mathbf{w}_{1,r}^{(0)},\bm{\mu}\rangle|\geq A^{(t)}-1, maxrγ1,r(t)\max_{r}\gamma_{1,r}^{(t)} will reach 22 within T1T_{1} iterations. We can next verify that

T1=log(6/σ0𝝁2)2q+1mC1ηqσ0q2𝝁2qnmη1σ02qσpqdq/22q+5q[4log(8mn/δ)](q1)/2=T1+/2,\displaystyle T_{1}=\frac{\log(6/\sigma_{0}\|\bm{\mu}\|_{2})2^{q+1}m}{C_{1}\eta q\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}\leq\frac{nm\eta^{-1}\sigma_{0}^{2-q}\sigma_{p}^{-q}d^{-q/2}}{2^{q+5}q[4\log(8mn/\delta)]^{(q-1)/2}}=T_{1}^{+}/2,

where the inequality holds due to our SNR condition in (D.1). Therefore, by the definition of T1,1T_{1,1}, we have T1,1T1T1+/2T_{1,1}\leq T_{1}\leq T_{1}^{+}/2, where we use the non-decreasing property of γ\gamma. The proof for j=1j=-1 is similar, and we can prove that maxrγ1,r(T1,1)2\max_{r}\gamma_{-1,r}^{(T_{1,-1})}\geq 2 while T1,1T1T1+/2T_{1,-1}\leq T_{1}\leq T_{1}^{+}/2, which completes the proof. ∎

D.2 Second Stage

By the results we get in the first stage we know that

𝐰j,r(T1)\displaystyle\mathbf{w}_{j,r}^{(T_{1})} =𝐰j,r(0)+jγj,r(T1)𝝁𝝁22+i=1nρ¯j,r,i(T1)𝝃i𝝃i22+i=1nρ¯j,r,i(T1)𝝃i𝝃i22.\displaystyle=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(T_{1})}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}.

And at the beginning of the second stage, we have following property holds:

  • maxrγj,r(T1)2,j{±1}\max_{r}\gamma_{j,r}^{(T_{1})}\geq 2,\forall j\in\{\pm 1\}.

  • maxj,r,i|ρj,r,i(T1)|β^\max_{j,r,i}|\rho_{j,r,i}^{(T_{1})}|\leq\widehat{\beta} where β^=σ0σpd/2\widehat{\beta}=\sigma_{0}\sigma_{p}\sqrt{d}/2.

Lemma 5.1 implies that the learned feature γj,r(t)\gamma_{j,r}^{(t)} will not get worse, i.e., for tT1t\geq T_{1}, we have that γj,r(t+1)γj,r(t)\gamma_{j,r}^{(t+1)}\geq\gamma_{j,r}^{(t)}, and therefore maxrγj,r(t)2\max_{r}\gamma_{j,r}^{(t)}\geq 2. Now we choose 𝐖\mathbf{W}^{*} as follows:

𝐰j,r=𝐰j,r(0)+2qmlog(2q/ϵ)j𝝁𝝁22.\displaystyle\mathbf{w}^{*}_{j,r}=\mathbf{w}_{j,r}^{(0)}+2qm\log(2q/\epsilon)\cdot j\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}.

Based on the above definition of 𝐖\mathbf{W}^{*}, we have the following lemma.

Lemma D.2.

Under the same conditions as Theorem 4.4, we have that 𝐖(T1)𝐖FO~(m3/2𝛍21)\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}\leq\widetilde{O}(m^{3/2}\|\bm{\mu}\|_{2}^{-1}).

Proof of Lemma D.2.

We have

𝐖(T1)𝐖F\displaystyle\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F} 𝐖(T1)𝐖(0)F+𝐖(0)𝐖F\displaystyle\leq\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\|_{F}+\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\|_{F}
j,rγj,r(T1)𝝁2+j,r,i|ρ¯j,r,i(T1)|𝝃i2+j,r,i|ρ¯j,r,i(T1)|𝝃i2+O(m3/2log(1/ϵ))𝝁21\displaystyle\leq\sum_{j,r}\frac{\gamma_{j,r}^{(T_{1})}}{\|\bm{\mu}\|_{2}}+\sum_{j,r,i}\frac{|\overline{\rho}_{j,r,i}^{(T_{1})}|}{\|\bm{\xi}_{i}\|_{2}}+\sum_{j,r,i}\frac{|\underline{\rho}_{j,r,i}^{(T_{1})}|}{\|\bm{\xi}_{i}\|_{2}}+O(m^{3/2}\log(1/\epsilon))\|\bm{\mu}\|_{2}^{-1}
O~(m𝝁1)+O(nmσ0)+O(m3/2log(1/ϵ))𝝁21\displaystyle\leq\widetilde{O}(m\|\bm{\mu}\|^{-1})+O(nm\sigma_{0})+O(m^{3/2}\log(1/\epsilon))\|\bm{\mu}\|_{2}^{-1}
O~(m3/2𝝁21),\displaystyle\leq\widetilde{O}(m^{3/2}\|\bm{\mu}\|_{2}^{-1}),

where the first inequality is by triangle inequality, the second inequality is by our decomposition of 𝐖(T1)\mathbf{W}^{(T_{1})} and the definition of 𝐖\mathbf{W}^{*}, the third inequality is by Proposition C.2 and Lemma D.1, and the last inequality is by our condition of σ0\sigma_{0} in Condition 4.2. ∎

Lemma D.3.

Under the same conditions as Theorem 4.3, we have that yif(𝐖(t),𝐱i),𝐖qlog(2q/ϵ)y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle\geq q\log(2q/\epsilon) for all i[n]i\in[n] and T1tTT_{1}\leq t\leq T^{*}.

Proof of Lemma D.3.

Recall that f(𝐖(t),𝐱i)=(1/m)j,rj[σ(𝐰j,r,yi𝝁)+σ(𝐰j,r,𝝃i)]f(\mathbf{W}^{(t)},\mathbf{x}_{i})=(1/m){\sum_{j,r}}j\cdot\big{[}\sigma(\langle\mathbf{w}_{j,r},y_{i}\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}_{i}\rangle)\big{]}, so we have

yif(𝐖(t),𝐱i),𝐖\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle =1mj,rσ(𝐰j,r(t),yi𝝁)𝝁,j𝐰j,r+1mj,rσ(𝐰j,r(t),𝝃i)yi𝝃i,j𝐰j,r\displaystyle=\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{*}\rangle+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{*}\rangle
=1mj,rσ(𝐰j,r(t),yi𝝁)2qmlog(2q/ϵ)+1mj,rσ(𝐰j,r(t),yi𝝁)𝝁,j𝐰j,r(0)\displaystyle=\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)2qm\log(2q/\epsilon)+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{(0)}\rangle
+1mj,rσ(𝐰j,r(t),𝝃i)yi𝝃i,j𝐰j,r(0)\displaystyle\qquad+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{(0)}\rangle
1mj,rσ(𝐰j,r(t),yi𝝁)2qmlog(2q/ϵ)1mj,rσ(𝐰j,r(t),yi𝝁)O~(σ0𝝁2)\displaystyle\geq\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)2qm\log(2q/\epsilon)-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})
1mj,rσ(𝐰j,r(t),𝝃i)O~(σ0σpd),\displaystyle\qquad-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}), (D.4)

where the inequality is by Lemma B.3. Next we will bound the inner-product terms in (D.4) respectively. By Lemma C.6, we have that for j=yij=y_{i}

maxr{𝐰j,r(t),yi𝝁}=maxr{γj,r(t)+𝐰j,r(0),yi𝝁}2O~(σ0𝝁2)1.\displaystyle\max_{r}\{\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle\}=\max_{r}\{\gamma_{j,r}^{(t)}+\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle\}\geq 2-\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})\geq 1. (D.5)

We can also get the upper bound of the inner products between the parameter and the signal (noise) as follows,

|𝐰j,r(t),𝝁|(i)|𝐰j,r(0),𝝁|+|γj,r(t)|(ii)O~(1)\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle|\overset{(i)}{\leq}|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|+|\gamma_{j,r}^{(t)}|\overset{(ii)}{\leq}\widetilde{O}(1)
|𝐰j,r(t),𝝃i|(iii)|𝐰j,r(0),𝝃i|+|ρ¯j,r,i(t)|+|ρ¯j,r,i(t)|+8nlog(4n2/δ)dα(iv)O~(1),\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle|\overset{(iii)}{\leq}|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|+|\underline{\rho}_{j,r,i}^{(t)}|+|\overline{\rho}_{j,r,i}^{(t)}|+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\overset{(iv)}{\leq}\widetilde{O}(1), (D.6)

where (i) is by Lemma C.3, (iii) is by Lemma C.4, (ii) and (iv) are due to Proposition C.2. Plugging (D.5) and (D.6) into (D.4) gives,

yif(𝐖(t),𝐱i),𝐖\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle 2qlog(2q/ϵ)O~(σ0𝝁2)O~(σ0σpd)qlog(2q/ϵ),\displaystyle\geq 2q\log(2q/\epsilon)-\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})-\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})\geq q\log(2q/\epsilon),

where the last inequality is by σ0O~(m2/(q2)n1)min{(σpd)1,𝝁21}\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\} in Condition 4.2. This completes the proof. ∎

Lemma D.4.

Under the same conditions as Theorem 4.3, we have that

𝐖(t)𝐖F2𝐖(t+1)𝐖F2(2q1)ηLS(𝐖(t))ηϵ\displaystyle\|\mathbf{W}^{(t)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(t+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon

for all T1tTT_{1}\leq t\leq T^{*}.

Proof of Lemma D.4.

We first apply a proof technique similar to Lemma 2.6 in Ji and Telgarsky (2020). The difference between our analysis and Ji and Telgarsky (2020) is that here the neural network is qq homogeneous rather than 1 homogeneous.

𝐖(t)𝐖F2𝐖(t+1)𝐖F2\displaystyle\|\mathbf{W}^{(t)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(t+1)}-\mathbf{W}^{*}\|_{F}^{2}
=2ηLS(𝐖(t)),𝐖(t)𝐖η2LS(𝐖(t))F2\displaystyle=2\eta\langle\nabla L_{S}(\mathbf{W}^{(t)}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
=2ηni=1ni(t)[qyif(𝐖(t),𝐱i)f(𝐖(t),𝐱i),𝐖]η2LS(𝐖(t))F2\displaystyle=\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle]-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
2ηni=1ni(t)[qyif(𝐖(t),𝐱i)qlog(2q/ϵ)]η2LS(𝐖(t))F2\displaystyle\geq\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-q\log(2q/\epsilon)]-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
2qηni=1n[(yif(𝐖(t),𝐱i))ϵ/(2q)]η2LS(𝐖(t))F2\displaystyle\geq\frac{2q\eta}{n}\sum_{i=1}^{n}[\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}-\epsilon/(2q)]-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
(2q1)ηLS(𝐖(t))ηϵ,\displaystyle\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon,

where the first inequality is by Lemma D.3, the second inequality is due to the convexity of the cross entropy function, and the last inequality is due to Lemma C.7. ∎

Lemma D.5 (Restatement of Lemma 5.6).

Under the same conditions as Theorem 4.3, let T=T1+𝐖(T1)𝐖F22ηϵ=T1+O~(m3η1ϵ1𝛍22)T=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}=T_{1}+\widetilde{O}(m^{3}\eta^{-1}\epsilon^{-1}\|\bm{\mu}\|_{2}^{-2}). Then we have maxj,r,i|ρj,r,i(t)|2β^=σ0σpd\max_{j,r,i}|\rho_{j,r,i}^{(t)}|\leq 2\widehat{\beta}=\sigma_{0}\sigma_{p}\sqrt{d} for all T1tTT_{1}\leq t\leq T. Besides,

1tT1+1s=T1tLS(𝐖(s))𝐖(T1)𝐖F2(2q1)η(tT1+1)+ϵ2q1\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{2q-1}

for all T1tTT_{1}\leq t\leq T, and we can find an iterate with training loss smaller than ϵ\epsilon within TT iterations.

Proof of Lemma D.5.

By Lemma D.4, for any t[T1,T]t\in[T_{1},T], we have that

𝐖(s)𝐖F2𝐖(s+1)𝐖F2(2q1)ηLS(𝐖(s))ηϵ\displaystyle\|\mathbf{W}^{(s)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(s+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(s)})-\eta\epsilon

holds for sts\leq t. Taking a summation, we obtain that

s=T1tLS(𝐖(s))\displaystyle\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)}) 𝐖(T1)𝐖F2+ηϵ(tT1+1)(2q1)η\displaystyle\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}+\eta\epsilon(t-T_{1}+1)}{(2q-1)\eta} (D.7)

for all T1tTT_{1}\leq t\leq T. Dividing (tT1+1)(t-T_{1}+1) on both side of (D.7) gives that

1tT1+1s=T1tLS(𝐖(s))𝐖(T1)𝐖F2(2q1)η(tT1+1)+ϵ2q1.\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{2q-1}.

Then we can take t=Tt=T and have that

1TT1+1s=T1TLS(𝐖(s))𝐖(T1)𝐖F2(2q1)η(TT1+1)+ϵ2q13ϵ2q1<ϵ,\displaystyle\frac{1}{T-T_{1}+1}\sum_{s=T_{1}}^{T}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(T-T_{1}+1)}+\frac{\epsilon}{2q-1}\leq\frac{3\epsilon}{2q-1}<\epsilon,

where we use the fact that q>2q>2 and our choice that T=T1+𝐖(T1)𝐖F22ηϵT=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}. Because the mean is smaller than ϵ\epsilon, we can conclude that there exist T1tTT_{1}\leq t\leq T such that LS(𝐖(t))<ϵL_{S}(\mathbf{W}^{(t)})<\epsilon.

Finally, we will prove that maxj,r,i|ρj,r,i(t)|2β^\max_{j,r,i}|\rho_{j,r,i}^{(t)}|\leq 2\widehat{\beta} for all t[T1,T]t\in[T_{1},T]. Plugging T=T1+𝐖(T1)𝐖F22ηϵT=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor} into (D.7) gives that

s=T1TLS(𝐖(s))\displaystyle\sum_{s=T_{1}}^{T}L_{S}(\mathbf{W}^{(s)}) 2𝐖(T1)𝐖F2(2q1)η=O~(η1m3𝝁22),\displaystyle\leq\frac{2\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta}=\widetilde{O}(\eta^{-1}m^{3}\|\bm{\mu}\|_{2}^{2}), (D.8)

where the inequality is due to 𝐖(T1)𝐖FO~(m3/2𝝁21)\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}\leq\widetilde{O}(m^{3/2}\|\bm{\mu}\|_{2}^{-1}) in Lemma D.2. Define Ψ(t)=maxj,r,i|ρj,r,i(t)|\Psi^{(t)}=\max_{j,r,i}|\rho_{j,r,i}^{(t)}|. We will use induction to prove Ψ(t)2β^\Psi^{(t)}\leq 2\widehat{\beta} for all t[T1,T]t\in[T_{1},T]. At t=T1t=T_{1}, by the definition of β^\widehat{\beta}, clearly we have Ψ(T1)β^2β^\Psi^{(T_{1})}\leq\widehat{\beta}\leq 2\widehat{\beta}. Now suppose that there exists T~[T1,T]\widetilde{T}\in[T_{1},T] such that Ψ(t)2β^\Psi^{(t)}\leq 2\widehat{\beta} for all t[T1,T~1]t\in[T_{1},\widetilde{T}-1]. Then for t[T1,T~1]t\in[T_{1},\widetilde{T}-1], by (C.4) and (C.5) we have

Ψ(t+1)\displaystyle\Psi^{(t+1)} Ψ(t)+maxj,r,i{ηnm|i(t)|σ(𝐰j,r(0),𝝃i+2i=1nΨ(t)|𝝃i,𝝃i|𝝃i22)𝝃i22}\displaystyle\leq\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot|\ell_{i}^{\prime(t)}|\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\sum_{i^{\prime}=1}^{n}\Psi^{(t)}\cdot\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}\bigg{\}}
=Ψ(t)+maxj,r,i{ηnm|i(t)|σ(𝐰j,r(0),𝝃i+2Ψ(t)+2iinΨ(t)|𝝃i,𝝃i|𝝃i22)𝝃i22}\displaystyle=\Psi^{(t)}+\max_{j,r,i}\bigg{\{}\frac{\eta}{nm}\cdot|\ell_{i}^{\prime(t)}|\cdot\sigma^{\prime}\Bigg{(}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+2\Psi^{(t)}+2\sum_{i^{\prime}\neq i}^{n}\Psi^{(t)}\cdot\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}}\Bigg{)}\cdot\|\bm{\xi}_{i^{\prime}}\|_{2}^{2}\bigg{\}}
Ψ(t)+ηqnmmaxi|i(t)|[2log(8mn/δ)σ0σpd\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\max_{i}|\ell_{i}^{\prime(t)}|\cdot\Bigg{[}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}
+(2+4nσp2dlog(4n2/δ)σp2d/2)Ψ(t)]q12σp2d\displaystyle\qquad+\Bigg{(}2+\frac{4n\sigma_{p}^{2}\cdot\sqrt{d\log(4n^{2}/\delta)}}{\sigma_{p}^{2}d/2}\Bigg{)}\cdot\Psi^{(t)}\Bigg{]}^{q-1}\cdot 2\sigma_{p}^{2}d
Ψ(t)+ηqnmmaxi|i(t)|(2log(8mn/δ)σ0σpd+4Ψ(t))q12σp2d,\displaystyle\leq\Psi^{(t)}+\frac{\eta q}{nm}\cdot\max_{i}|\ell_{i}^{\prime(t)}|\cdot\big{(}2\cdot\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+4\cdot\Psi^{(t)}\big{)}^{q-1}\cdot 2\sigma_{p}^{2}d,

where the second inequality is due to Lemmas B.2 and B.3, and the last inequality follows by the assumption that d16n2log(4n2/δ)d\geq 16n^{2}\log(4n^{2}/\delta). Taking a telescoping sum over t=0,1,,T~1t=0,1,\ldots,\widetilde{T}-1, we have that

Ψ(T)\displaystyle\Psi^{(T)} (i)Ψ(T1)+ηqnms=T1T~1maxi|i(s)|O~(σp2d)β^q1\displaystyle\overset{(i)}{\leq}\Psi^{(T_{1})}+\frac{\eta q}{nm}\sum_{s=T_{1}}^{\widetilde{T}-1}\max_{i}|\ell_{i}^{\prime(s)}|\widetilde{O}(\sigma_{p}^{2}d)\widehat{\beta}^{q-1}
(ii)Ψ(T1)+ηqnmO~(σp2d)β^q1s=T1T~1maxii(s)\displaystyle\overset{(ii)}{\leq}\Psi^{(T_{1})}+\frac{\eta q}{nm}\widetilde{O}(\sigma_{p}^{2}d)\widehat{\beta}^{q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}\max_{i}\ell_{i}^{(s)}
(iii)Ψ(T1)+O~(ηm1σp2d)β^q1s=T1T~1LS(𝐖(s))\displaystyle\overset{(iii)}{\leq}\Psi^{(T_{1})}+\widetilde{O}(\eta m^{-1}\sigma_{p}^{2}d)\widehat{\beta}^{q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}L_{S}(\mathbf{W}^{(s)})
(iv)Ψ(T1)+O~(m2SNR2)β^q1\displaystyle\overset{(iv)}{\leq}\Psi^{(T_{1})}+\widetilde{O}(m^{2}\mathrm{SNR}^{-2})\widehat{\beta}^{q-1}
(v)β^+O~(m2n2/qβ^q2)β^\displaystyle\overset{(v)}{\leq}\widehat{\beta}+\widetilde{O}(m^{2}n^{2/q}\widehat{\beta}^{q-2})\widehat{\beta}
(vi)2β^,\displaystyle\overset{(vi)}{\leq}2\widehat{\beta},

where (i) is by out induction hypothesis that Ψ(t)2β^\Psi^{(t)}\leq 2\widehat{\beta}, (ii) is by |||\ell^{\prime}|\leq\ell, (iii) is by maxii(s)ii(s)=nLS(𝐖(s))\max_{i}\ell_{i}^{(s)}\leq\sum_{i}\ell_{i}^{(s)}=nL_{S}(\mathbf{W}^{(s)}), (iv) is due to s=T1T~1LS(𝐖(s))s=T1TLS(𝐖(s))=O~(η1m3𝝁22)\sum_{s=T_{1}}^{\widetilde{T}-1}L_{S}(\mathbf{W}^{(s)})\leq\sum_{s=T_{1}}^{T}L_{S}(\mathbf{W}^{(s)})=\widetilde{O}(\eta^{-1}m^{3}\|\bm{\mu}\|_{2}^{2}) in (D.8), (v) is by nSNRqΩ~(1)n\mathrm{SNR}^{q}\geq\widetilde{\Omega}(1), and (vi) is by the definition that β^=σ0σpd/2\widehat{\beta}=\sigma_{0}\sigma_{p}\sqrt{d}/2 and O~(m2n2/qβ^q2)=O~(m2n2/q(σ0σpd)q2)1\widetilde{O}(m^{2}n^{2/q}\widehat{\beta}^{q-2})=\widetilde{O}(m^{2}n^{2/q}(\sigma_{0}\sigma_{p}\sqrt{d})^{q-2})\leq 1 by Condition 4.2. Therefore, Ψ(T~)2β^\Psi^{(\widetilde{T})}\leq 2\widehat{\beta}, which completes the induction. ∎

D.3 Population Loss

Consider a new data point (𝐱,y)(\mathbf{x},y) drawn from the distribution defined in Definition 3.1. Without loss of generality, we suppose that the first patch is the signal patch and the second patch is the noise patch, i.e., 𝐱=[y𝝁,𝝃]\mathbf{x}=[y\bm{\mu},\bm{\xi}]. Moreover, by the signal-noise decomposition, the learned neural network has parameter

𝐰j,r(t)\displaystyle\mathbf{w}_{j,r}^{(t)} =𝐰j,r(0)+jγj,r(t)𝝁𝝁22+i=1nρ¯j,r,i(t)𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i𝝃i22\displaystyle=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}

for j{±1}j\in\{\pm 1\} and r[m]r\in[m].

Lemma D.6.

Under the same conditions as Theorem 4.3, we have that maxj,r|𝐰j,r(t),𝛏i|1/2\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle|\leq 1/2 for all 0tT0\leq t\leq T.

Proof.

We can get the upper bound of the inner products between the parameter and the noise as follows:

|𝐰j,r(t),𝝃i|\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle| (i)|𝐰j,r(0),𝝃i|+|ρ¯j,r,i(t)|+|ρ¯j,r,i(t)|+8nlog(4n2/δ)dα\displaystyle\overset{(i)}{\leq}|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|+|\underline{\rho}_{j,r,i}^{(t)}|+|\overline{\rho}_{j,r,i}^{(t)}|+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
(ii)2log(8mn/δ)σ0σpd+σ0σpd+8nlog(4n2/δ)dα\displaystyle\overset{(ii)}{\leq}2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}+\sigma_{0}\sigma_{p}\sqrt{d}+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
(iii)1/2\displaystyle\overset{(iii)}{\leq}1/2

for all j{±1}j\in\{\pm 1\}, r[m]r\in[m] and i[n]i\in[n], where (i) is by Lemma C.3, (ii) is due to |𝐰j,r(0),𝝃i|2log(8mn/δ)σ0σpd|\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle|\leq 2\sqrt{\log(8mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d} in Lemma B.3 and maxj,r,i|ρj,r,i(t)|σ0σpd\max_{j,r,i}|\rho_{j,r,i}^{(t)}|\leq\sigma_{0}\sigma_{p}\sqrt{d} in Lemma D.5, and (iii) is due to our condition of σ0O~(m2/(q2)n1)(σpd)1\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot(\sigma_{p}\sqrt{d})^{-1} and dΩ~(m2n4)d\geq\widetilde{\Omega}(m^{2}n^{4}) in Condition 4.2. ∎

Lemma D.7.

Under the same conditions as Theorem 4.3, with probability at least 14mTexp(C21σ02σp2d1)1-4mT\cdot\exp(-C_{2}^{-1}\sigma_{0}^{-2}\sigma_{p}^{-2}d^{-1}), we have that maxj,r|𝐰j,r(t),𝛏|1/2\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|\leq 1/2 for all 0tT0\leq t\leq T, where C2=O~(1)C_{2}=\widetilde{O}(1).

Proof of Lemma D.7.

Let 𝐰~j,r(t)=𝐰j,r(t)jγj,r(t)𝝁𝝁22\widetilde{\mathbf{w}}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(t)}-j\cdot\gamma_{j,r}^{(t)}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}, then we have that 𝐰~j,r(t),𝝃=𝐰j,r(t),𝝃\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle=\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle and

𝐰~j,r(t)2O~(σ0d+nσ0)=O~(σ0d),\displaystyle\|\widetilde{\mathbf{w}}_{j,r}^{(t)}\|_{2}\leq\widetilde{O}(\sigma_{0}\sqrt{d}+n\sigma_{0})=\widetilde{O}(\sigma_{0}\sqrt{d}), (D.9)

where the equality is due to dΩ~(m2n4)d\geq\widetilde{\Omega}(m^{2}n^{4}) by Condition 4.2.

By (D.9), maxj,r𝐰~j,r(t)2C1σ0d\max_{j,r}\|\widetilde{\mathbf{w}}_{j,r}^{(t)}\|_{2}\leq C_{1}\sigma_{0}\sqrt{d}, where C1=O~(1)C_{1}=\widetilde{O}(1). Clearly 𝐰~j,r(t),𝝃\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle is a Gaussian distribution with mean zero and standard deviation smaller than C1σ0σpdC_{1}\sigma_{0}\sigma_{p}\sqrt{d}. Therefore, the probability is bounded by

(|𝐰~j,r(t),𝝃|1/2)2exp(18C12σ02σp2d).\displaystyle\mathbb{P}\big{(}|\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle|\geq 1/2\big{)}\leq 2\exp\bigg{(}-\frac{1}{8C_{1}^{2}\sigma_{0}^{2}\sigma_{p}^{2}d}\bigg{)}.

Applying a union bound over j,r,tj,r,t completes the proof. ∎

Lemma D.8 (Restatement of Lemma 5.7).

Let TT be defined in Lemma 5.5 respectively. Under the same conditions as Theorem 4.3, for any 0tT0\leq t\leq T with LS(𝐖(t))1L_{S}(\mathbf{W}^{(t)})\leq 1, it holds that L𝒟(𝐖(t))6LS(𝐖(t))+exp(n2)L_{\mathcal{D}}(\mathbf{W}^{(t)})\leq 6\cdot L_{S}(\mathbf{W}^{(t)})+\exp(-n^{2}).

Proof of Lemma D.8.

Let event \mathcal{E} to be the event that Lemma D.7 holds. Then we can divide L𝒟(𝐖(t))L_{\mathcal{D}}(\mathbf{W}^{(t)}) into two parts:

𝔼[(yf(𝐖(t),𝐱))]\displaystyle\mathbb{E}\big{[}\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}\big{]} =𝔼[𝟙()(yf(𝐖(t),𝐱))]I1+𝔼[𝟙(c)(yf(𝐖(t),𝐱))]I2.\displaystyle=\underbrace{\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}]}_{I_{1}}+\underbrace{\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E}^{c})\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}]}_{I_{2}}. (D.10)

In the following, we bound I1I_{1} and I2I_{2} respectively.

Bounding I1I_{1}: Since LS(𝐖(t))1L_{S}(\mathbf{W}^{(t)})\leq 1, there must exist one (𝐱i,yi)(\mathbf{x}_{i},y_{i}) such that (yif(𝐖(t),𝐱i))LS(𝐖(t))1\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\leq L_{S}(\mathbf{W}^{(t)})\leq 1, which implies that yif(𝐖(t),𝐱i)0y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\geq 0. Therefore, we have that

exp(yif(𝐖(t),𝐱i))(i)2log(1+exp(yif(𝐖(t),𝐱i)))=2(yif(𝐖(t),𝐱i))2LS(𝐖(t)),\displaystyle\exp(-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))\overset{(i)}{\leq}2\log\big{(}1+\exp(-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))\big{)}=2\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}\leq 2L_{S}(\mathbf{W}^{(t)}), (D.11)

where (i) is by z2log(1+z),z1z\leq 2\log(1+z),\forall z\leq 1. If event \mathcal{E} holds, we have that

|yf(𝐖(t),𝐱)yif(𝐖(t),𝐱i)|\displaystyle|yf(\mathbf{W}^{(t)},\mathbf{x})-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})| 1mj,rσ(𝐰j,r,𝝃i)+1mj,rσ(𝐰j,r,𝝃)\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}_{i}\rangle)+\frac{1}{m}\sum_{j,r}\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}\rangle)
1mj,rσ(1/2)+1mj,rσ(1/2)\displaystyle\leq\frac{1}{m}\sum_{j,r}\sigma(1/2)+\frac{1}{m}\sum_{j,r}\sigma(1/2)
1,\displaystyle\leq 1, (D.12)

where the second inequality is by maxj,r|𝐰j,r(t),𝝃|1/2\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|\leq 1/2 in Lemma D.7 and maxj,r|𝐰j,r(t),𝝃i|1/2\max_{j,r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle|\leq 1/2 in Lemma D.6. Thus we have that

I1\displaystyle I_{1} 𝔼[𝟙()exp(yf(𝐖(t),𝐱))]\displaystyle\leq\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})\exp(-yf(\mathbf{W}^{(t)},\mathbf{x}))]
e𝔼[𝟙()exp(yif(𝐖(t),𝐱i))]\displaystyle\leq e\cdot\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})\exp(-y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i}))]
2e𝔼[𝟙()LS(𝐖(t))],\displaystyle\leq 2e\cdot\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E})L_{S}(\mathbf{W}^{(t)})],

where the first inequality is by the property of cross-entropy loss that (z)exp(z)\ell(z)\leq\exp(-z) for all zz, the second inequality is by (D.12), and the third inequality is by (D.11). Dropping the event in the expectation gives I16LS(𝐖(t))I_{1}\leq 6L_{S}(\mathbf{W}^{(t)}).

Bounding I2I_{2}: Next we bound the second term I2I_{2}. We choose an arbitrary training data (𝐱i,yi)(\mathbf{x}_{i^{\prime}},y_{i^{\prime}}) such that yi=yy_{i^{\prime}}=y. Then we have

(yf(𝐖(t),𝐱))\displaystyle\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)} log(1+exp(Fy(𝐖(t),𝐱)))\displaystyle\leq\log(1+\exp(F_{-y}(\mathbf{W}^{(t)},\mathbf{x})))
1+Fy(𝐖(t),𝐱)\displaystyle\leq 1+F_{-y}(\mathbf{W}^{(t)},\mathbf{x})
=1+1mj=y,r[m]σ(𝐰j,r(t),y𝝁)+1mj=y,r[m]σ(𝐰j,r(t),𝝃)\displaystyle=1+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}\rangle)+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)
1+Fyi](𝐖yi,𝐱i)+1mj=y,r[m]σ(𝐰j,r(t),𝝃)\displaystyle\leq 1+F_{-y_{i]}}(\mathbf{W}_{-y_{i^{\prime}}},\mathbf{x}_{i^{\prime}})+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)
2+1mj=y,r[m]σ(𝐰j,r(t),𝝃)\displaystyle\leq 2+\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)
2+O~((σ0d)q)𝝃q,\displaystyle\leq 2+\widetilde{O}((\sigma_{0}\sqrt{d})^{q})\|\bm{\xi}\|^{q}, (D.13)

where the first inequality is due to Fy(𝐖(t),𝐱)0F_{y}(\mathbf{W}^{(t)},\mathbf{x})\geq 0, the second inequality is by the property of cross-entropy loss, i.e., log(1+exp(z))1+z\log(1+\exp(z))\leq 1+z for all z0z\geq 0, the third inequality is by 1mj=y,r[m]σ(𝐰j,r(t),y𝝁)Fy(𝐖y,𝐱i)=Fyi(𝐖yi,𝐱i)\frac{1}{m}\sum_{j=-y,r\in[m]}\sigma(\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}\rangle)\leq F_{-y}(\mathbf{W}_{-y},\mathbf{x}_{i^{\prime}})=F_{-y_{i^{\prime}}}(\mathbf{W}_{-y_{i^{\prime}}},\mathbf{x}_{i^{\prime}}), the fourth inequality is by Fyi(𝐖yi,𝐱i)1F_{-y_{i^{\prime}}}(\mathbf{W}_{-y_{i^{\prime}}},\mathbf{x}_{i^{\prime}})\leq 1 in Lemma C.5, and the last inequality is due to 𝐰~j,r(t),𝝃=𝐰j,r(t),𝝃𝐰~j,r(t)2𝝃2O~(σ0d)𝝃2\langle\widetilde{\mathbf{w}}_{j,r}^{(t)},\bm{\xi}\rangle=\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle\leq\|\widetilde{\mathbf{w}}_{j,r}^{(t)}\|_{2}\|\bm{\xi}\|_{2}\leq\widetilde{O}(\sigma_{0}\sqrt{d})\|\bm{\xi}\|_{2} in (D.9). Then we further have that

I2\displaystyle I_{2} 𝔼[𝟙(c)]𝔼[(yf(𝐖(t),𝐱))2]\displaystyle\leq\sqrt{\mathbb{E}[\operatorname{\mathds{1}}(\mathcal{E}^{c})]}\cdot\sqrt{\mathbb{E}\Big{[}\ell\big{(}yf(\mathbf{W}^{(t)},\mathbf{x})\big{)}^{2}\Big{]}}
(c)4+O~((σ0d)2q)𝔼[𝝃22q]\displaystyle\leq\sqrt{\mathbb{P}(\mathcal{E}^{c})}\cdot\sqrt{4+\widetilde{O}((\sigma_{0}\sqrt{d})^{2q})\mathbb{E}[\|\bm{\xi}\|_{2}^{2q}]}
exp[Ω~(σ02σp2d1)+polylog(d)]\displaystyle\leq\exp[-\widetilde{\Omega}(\sigma_{0}^{-2}\sigma_{p}^{-2}d^{-1})+\text{polylog}(d)]
exp(n2),\displaystyle\leq\exp(-n^{2}),

where the first inequality is by Cauchy-Schwartz inequality, the second inequality is by (D.13), the third inequality is by Lemma D.7 and the fact that 4+O~((σ0d)2q)𝔼[𝝃22q]=O(poly(d))\sqrt{4+\widetilde{O}((\sigma_{0}\sqrt{d})^{2q})\mathbb{E}[\|\bm{\xi}\|_{2}^{2q}]}=O(\mathrm{poly}(d)), and the last inequality is by our condition σ0O~(m2/(q2)n1)(σpd)1\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot(\sigma_{p}\sqrt{d})^{-1} in Condition 4.2. Plugging the bounds of I1I_{1}, I2I_{2} into (D.10) completes the proof. ∎

Appendix E Noise Memorization

In this section, we will consider the noise memorization case under the condition that σpq(d)qΩ~(n𝝁2q)\sigma_{p}^{q}(\sqrt{d})^{q}\geq\widetilde{\Omega}(n\|\bm{\mu}\|_{2}^{q}). We remind the readers that the proofs in this section are based on the results in Section B, which hold with high probability.

We also remind readers that α=4log(T)\alpha=4\log(T^{*}) is defined in Appendix C. Denote β¯=minimaxr𝐰yi,r(0),𝝃i\bar{\beta}=\min_{i}\max_{r}\langle\mathbf{w}_{y_{i},r}^{(0)},\bm{\xi}_{i}\rangle. The following lemma provides a lower bound of β¯\bar{\beta}.

Lemma E.1.

Under the same conditions as Theorem 4.4, if in particular

σ080nlog(4n2/δ)dαmin{(σpd)1,𝝁21},\displaystyle\sigma_{0}\geq 80n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}, (E.1)

then we have that β¯σ0σpd/420nlog(4n2/δ)dα\bar{\beta}\geq\sigma_{0}\sigma_{p}\sqrt{d}/4\geq 20n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha.

Proof of Lemma E.1.

Because σpq(d)qΩ~(n𝝁2q)\sigma_{p}^{q}(\sqrt{d})^{q}\geq\widetilde{\Omega}(n\|\bm{\mu}\|_{2}^{q}), we have that σpd𝝁2\sigma_{p}\sqrt{d}\geq\|\bm{\mu}\|_{2}. Therefore we have that

β¯\displaystyle\bar{\beta} σ0σpd/4\displaystyle\geq\sigma_{0}\sigma_{p}\sqrt{d}/4
=σ0/4max{σpd,𝝁2}\displaystyle=\sigma_{0}/4\cdot\max\{\sigma_{p}\sqrt{d},\|\bm{\mu}\|_{2}\}
20nlog(4n2/δ)dα,\displaystyle\geq 20n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha,

where the first inequality is by Lemma B.3 and the last inequality is by our lower bound condition of σ0\sigma_{0} in (E.1). ∎

E.1 First Stage

Lemma E.2.

Under the same conditions as Theorem 4.4, in particular if we choose

n1SNRqC2q+2log(20/(σ0σpd))(2log(8m/δ))q20.15q2,\displaystyle n^{-1}\mathrm{SNR}^{-q}\geq\frac{C2^{q+2}\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}(\sqrt{2\log(8m/\delta)})^{q-2}}{0.15^{q-2}}, (E.2)

where C=O(1)C=O(1) is a positive constant, then there exist

T1=Clog(20/(σ0σpd))4mn0.15q2ηqσ0q2(σp2d)q\displaystyle T_{1}=\frac{C\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}4mn}{0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}

such that

  • maxj,rρ¯j,r,i(T1)2\max_{j,r}\overline{\rho}_{j,r,i}^{(T_{1})}\geq 2 for all i[n]i\in[n].

  • maxj,rγj,r(t)=O~(σ0𝝁2)\max_{j,r}\gamma_{j,r}^{(t)}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}) for all 0tT10\leq t\leq T_{1}.

  • maxj,r,i|ρ¯j,r,i(t)|=O~(σ0σpd)\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(t)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}) for all 0tT10\leq t\leq T_{1}.

Proof of Lemma E.2.

Let

T1+=mηq2q1(2log(8m/δ))q2σ0q2𝝁2q.\displaystyle T_{1}^{+}=\frac{m}{\eta q2^{q-1}(\sqrt{2\log(8m/\delta)})^{q-2}\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}. (E.3)

By Proposition C.2, we have that ρ¯j,r,i(t)β16nlog(4n2/δ)dαββ¯\underline{\rho}_{j,r,i}^{(t)}\geq-\beta-16n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\geq-\beta-\bar{\beta} for all j{±1}j\in\{\pm 1\}, r[m]r\in[m], i[n]i\in[n] and 0tT0\leq t\leq T^{*}. Since ρ¯j,r,i(t)0\underline{\rho}_{j,r,i}^{(t)}\leq 0 and β¯β=O~(σ0σpd)\bar{\beta}\leq\beta=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}), we have that maxj,r,i|ρ¯j,r,i(t)|=O~(σ0σpd)\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(t)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}). Next, we will carefully compute the growth of the γj,r(t)\gamma_{j,r}^{(t)}.

γj,r(t+1)\displaystyle\gamma_{j,r}^{(t+1)} =γj,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),yi𝝁)𝝁22\displaystyle=\gamma_{j,r}^{(t)}-\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\|\bm{\mu}\|_{2}^{2}
γj,r(t)+ηnmi=1nσ(|𝐰j,r(0),𝝁|+γj,r(t))𝝁22,\displaystyle\leq\gamma_{j,r}^{(t)}+\frac{\eta}{nm}\cdot\sum_{i=1}^{n}\sigma^{\prime}(|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|+\gamma_{j,r}^{(t)})\|\bm{\mu}\|_{2}^{2},

where the inequality is by ||1|\ell^{\prime}|\leq 1. Let A(t)=maxj,r{γj,r(t)+|𝐰j,r(0),𝝁|}A^{(t)}=\max_{j,r}\{\gamma_{j,r}^{(t)}+|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\}, then we have that

A(t+1)A(t)+ηq𝝁22m[A(t)]q1.\displaystyle A^{(t+1)}\leq A^{(t)}+\frac{\eta q\|\bm{\mu}\|_{2}^{2}}{m}[A^{(t)}]^{q-1}. (E.4)

We will use induction to prove that A(t)2A(0)A^{(t)}\leq 2A^{(0)} for tT1+t\leq T_{1}^{+}. By definition, clearly we have that A(0)2A(0)A^{(0)}\leq 2A^{(0)}. Now suppose that there exists some T~T1+\widetilde{T}\leq T_{1}^{+} such that A(t)2A(0)A^{(t)}\leq 2A^{(0)} holds for 0tT~10\leq t\leq\widetilde{T}-1. Taking a telescoping sum of (E.4) gives that

A(T~)\displaystyle A^{(\widetilde{T})} A(0)+s=0T~ηq𝝁22m[A(s)]q1\displaystyle\leq A^{(0)}+\sum_{s=0}^{\widetilde{T}}\frac{\eta q\|\bm{\mu}\|_{2}^{2}}{m}[A^{(s)}]^{q-1}
A(0)+ηq𝝁22T1+2q1m[A(0)]q1\displaystyle\leq A^{(0)}+\frac{\eta q\|\bm{\mu}\|_{2}^{2}T_{1}^{+}2^{q-1}}{m}[A^{(0)}]^{q-1}
A(0)+ηq𝝁22T1+2q1m[2log(8m/δ)σ0𝝁2]q2A(0)\displaystyle\leq A^{(0)}+\frac{\eta q\|\bm{\mu}\|_{2}^{2}T_{1}^{+}2^{q-1}}{m}[\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2}]^{q-2}A^{(0)}
2A(0),\displaystyle\leq 2A^{(0)},

where the second inequality is by our induction hypothesis, the third inequality is by A02log(8m/δ)σ0𝝁2A_{0}\leq\sqrt{2\log(8m/\delta)}\cdot\sigma_{0}\|\bm{\mu}\|_{2} in Lemma B.3, and the last inequality is by (E.3). Thus we have that A(t)2A(0)A^{(t)}\leq 2A^{(0)} for all tT1+t\leq T_{1}^{+}. Therefore, maxj,rγj,r(t)A(t)+maxj,r{|𝐰j,r(0),𝝁|}3A(0)\max_{j,r}\gamma_{j,r}^{(t)}\leq A^{(t)}+\max_{j,r}\{|\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}\rangle|\}\leq 3A^{(0)} for all 0tT1+0\leq t\leq T_{1}^{+}. Recall that

ρ¯j,r,i(t+1)\displaystyle\overline{\rho}_{j,r,i}^{(t+1)} =ρ¯j,r,i(t)ηnmi(t)σ(𝐰j,r(t),𝝃i)𝟙(yi=j)𝝃i22.\displaystyle=\overline{\rho}_{j,r,i}^{(t)}-\frac{\eta}{nm}\cdot\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}^{(t)}_{j,r},\bm{\xi}_{i}\rangle)\cdot\operatorname{\mathds{1}}(y_{i}=j)\|\bm{\xi}_{i}\|_{2}^{2}.

For yi=jy_{i}=j, Lemma C.4 implies that

𝐰j,r(t),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle 𝐰j,r(0),𝝃i+ρ¯j,r,i(t)8nlog(4n2/δ)dα\displaystyle\geq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha
ρ¯j,r,i(t)+𝐰j,r(0),𝝃i0.4β¯,\displaystyle\geq\overline{\rho}^{(t)}_{j,r,i}+\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle-0.4\bar{\beta},

where the last inequality is by β¯20nlog(4n2/δ)dα\bar{\beta}\geq 20n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha. Now let Bi(t)=maxj=yi,r{ρ¯j,r,i(t)+𝐰j,r(0),𝝃i0.4β¯}B_{i}^{(t)}=\max_{j=y_{i},r}\{\overline{\rho}_{j,r,i}^{(t)}+\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle-0.4\bar{\beta}\}. For each ii, denote by T1(i)T_{1}^{(i)} the last time in the period [0,T1+][0,T_{1}^{+}] satisfying that ρ¯j,r,i(t)2\overline{\rho}_{j,r,i}^{(t)}\leq 2. Then for tT1(i)t\leq T_{1}^{(i)}, maxj,r{|ρ¯j,r,i(t)|,|ρ¯j,r,i(t)|}=O(1)\max_{j,r}\{|\overline{\rho}_{j,r,i}^{(t)}|,|\underline{\rho}_{j,r,i}^{(t)}|\}=O(1) and maxj,rγj,r(t)3A(0)=O(1)\max_{j,r}\gamma_{j,r}^{(t)}\leq 3A^{(0)}=O(1). Therefore, by Lemmas C.5 and C.6, we know that F1(𝐖(t),𝐱i),F+1(𝐖(t),𝐱i)=O(1)F_{-1}(\mathbf{W}^{(t)},\mathbf{x}_{i}),F_{+1}(\mathbf{W}^{(t)},\mathbf{x}_{i})=O(1). Thus there exists a positive constant C1C_{1} such that i(t)C1-\ell^{\prime(t)}_{i}\geq C_{1} for all 0tT1(i)0\leq t\leq T_{1}^{(i)}. It is also easy to check that Bi(0)0.6β¯0.15σ0σpdB_{i}^{(0)}\geq 0.6\bar{\beta}\geq 0.15\sigma_{0}\sigma_{p}\sqrt{d}. Then we can carefully compute the growth of Bi(t)B_{i}^{(t)},

Bi(t+1)\displaystyle B_{i}^{(t+1)} Bi(t)+C1ηqσp2d2nm[Bi(t)]q1\displaystyle\geq B_{i}^{(t)}+\frac{C_{1}\eta q\sigma_{p}^{2}d}{2nm}[B_{i}^{(t)}]^{q-1}
Bi(t)+C1ηqσp2d2nm[Bi(0)]q2Bi(t)\displaystyle\geq B_{i}^{(t)}+\frac{C_{1}\eta q\sigma_{p}^{2}d}{2nm}[B_{i}^{(0)}]^{q-2}B_{i}^{(t)}
(1+C10.15q2ηqσ0q2(σp2d)q2nm)Bi(t),\displaystyle\geq\bigg{(}1+\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{2nm}\bigg{)}B_{i}^{(t)},

where the second inequality is by the non-decreasing property of Bi(t)B_{i}^{(t)}. Therefore, Bi(t)B_{i}^{(t)} is an exponentially increasing sequence and we have that

Bi(t)\displaystyle B_{i}^{(t)} (1+C10.15q2ηqσ0q2(σp2d)q2nm)tBi(0)\displaystyle\geq\bigg{(}1+\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{2nm}\bigg{)}^{t}B_{i}^{(0)}
exp(C10.15q2ηqσ0q2(σp2d)q4nmt)Bi(0)\displaystyle\geq\exp\bigg{(}\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{4nm}t\bigg{)}B_{i}^{(0)}
exp(C10.15q2ηqσ0q2(σp2d)q4nmt)0.15σ0σpd,\displaystyle\geq\exp\bigg{(}\frac{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}{4nm}t\bigg{)}\cdot 0.15\sigma_{0}\sigma_{p}\sqrt{d},

where the second inequality is due to the fact that 1+zexp(z/2)1+z\geq\exp(z/2) for z2z\leq 2 and our conditions of η\eta and σ0\sigma_{0} listed in Condition 4.2, and the last inequality is due to Bi(0)0.15σ0σpdB_{i}^{(0)}\geq 0.15\sigma_{0}\sigma_{p}\sqrt{d}. Therefore, Bi(t)B_{i}^{(t)} will reach 33 within T1=log(20/(σ0σpd))4mnC10.15q2ηqσ0q2(σp2d)qT_{1}=\frac{\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}4mn}{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}} iterations. Since maxj=yi,rρ¯j,r,i(t)Bi(t)maxj=yi,r|𝐰j,r(0),ξi|+0.4β¯Bi(t)1\max_{j=y_{i},r}\overline{\rho}_{j,r,i}^{(t)}\geq B_{i}^{(t)}-\max_{j=y_{i},r}|\langle\mathbf{w}_{j,r}^{(0)},\xi_{i}\rangle|+0.4\bar{\beta}\geq B_{i}^{(t)}-1, maxj=yi,rρ¯j,r,i(t)\max_{j=y_{i},r}\overline{\rho}_{j,r,i}^{(t)} will reach 22 within T1T_{1} iterations. We can next verify that

T1=log(20/(σ0σpd))4mnC10.15q2ηqσ0q2(σp2d)qmηq2q(2log(8m/δ))q2σ0q2𝝁2q=T1+/2,\displaystyle T_{1}=\frac{\log\big{(}20/(\sigma_{0}\sigma_{p}\sqrt{d})\big{)}4mn}{C_{1}0.15^{q-2}\eta q\sigma_{0}^{q-2}(\sigma_{p}^{2}\sqrt{d})^{q}}\leq\frac{m}{\eta q2^{q}(\sqrt{2\log(8m/\delta)})^{q-2}\sigma_{0}^{q-2}\|\bm{\mu}\|_{2}^{q}}=T_{1}^{+}/2,

where the inequality holds due to our SNR condition in (E.2). Therefore, by the definition of T1(i)T_{1}^{(i)}, we have T1(i)T1T1+/2T_{1}^{(i)}\leq T_{1}\leq T_{1}^{+}/2, where we use the non-decreasing property of ρ¯j,r,i\overline{\rho}_{j,r,i}. This completes the proof. ∎

E.2 Second Stage

By the signal-noise decompositon, at the end of the first stage, we have

𝐰j,r(T1)\displaystyle\mathbf{w}_{j,r}^{(T_{1})} =𝐰j,r(0)+jγj,r(T1)𝝁𝝁22+i=1nρ¯j,r,i(T1)𝝃i𝝃i22+i=1nρ¯j,r,i(T1)𝝃i𝝃i22\displaystyle=\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(T_{1})}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}

for j{±1}j\in\{\pm 1\} and r[m]r\in[m]. By the results we get in the first stage, we know that at the beginning of this stage, we have following property holds:

  • maxrρ¯yi,r,i(T1)2\max_{r}\overline{\rho}_{y_{i},r,i}^{(T_{1})}\geq 2 for all i[n]i\in[n].

  • maxj,r,i|ρ¯j,r,i(T1)|=O~(σ0σpd)\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(T_{1})}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}).

  • maxj,rγj,r(T1)β^\max_{j,r}\gamma_{j,r}^{(T_{1})}\leq\widehat{\beta}^{\prime}, where β^=O~(σ0𝝁2)\widehat{\beta}^{\prime}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}).

Note that Lemma 5.1 implies that the learned noise ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t)} will not decrease, i.e., ρ¯j,r,i(t+1)ρ¯j,r,i(t)\overline{\rho}_{j,r,i}^{(t+1)}\geq\overline{\rho}_{j,r,i}^{(t)}. Therefore, for all data index ii, we have maxrρ¯yi,r,i(t)2\max_{r}\overline{\rho}_{y_{i},r,i}^{(t)}\geq 2 for all tT1t\geq T_{1}. Now we choose 𝐖\mathbf{W}^{*} as follows

𝐰j,r=𝐰j,r(0)+2qmlog(2q/ϵ))[i=1n𝟙(j=yi)𝝃i𝝃i2].\displaystyle\mathbf{w}^{*}_{j,r}=\mathbf{w}_{j,r}^{(0)}+2qm\log(2q/\epsilon))\bigg{[}\sum_{i=1}^{n}\operatorname{\mathds{1}}(j=y_{i})\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}}\bigg{]}.

Based on the definition of 𝐖\mathbf{W}^{*}, we have the following lemma.

Lemma E.3.

Under the same conditions as Theorem 4.4, we have that 𝐖(T1)𝐖FO~(m2n1/2σp1d1/2)\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}\leq\widetilde{O}(m^{2}n^{1/2}\sigma_{p}^{-1}d^{-1/2}).

Proof of Lemma E.3.

We have

𝐖(T1)𝐖F\displaystyle\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F} 𝐖(T1)𝐖(0)F+𝐖(0)𝐖F\displaystyle\leq\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{(0)}\|_{F}+\|\mathbf{W}^{(0)}-\mathbf{W}^{*}\|_{F}
j,rγj,r(T1)𝝁21+O(m)maxj,ri=1nρ¯j,r,i(T1)𝝃i𝝃i22+i=1nρ¯j,r,i(T1)𝝃i𝝃i222\displaystyle\leq\sum_{j,r}\gamma_{j,r}^{(T_{1})}\|\bm{\mu}\|_{2}^{-1}+O(\sqrt{m})\max_{j,r}\bigg{\|}\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(T_{1})}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}\bigg{\|}_{2}
+O(m3/2n1/2log(1/ϵ)σp1d1/2)\displaystyle\qquad+O(m^{3/2}n^{1/2}\log(1/\epsilon)\sigma_{p}^{-1}d^{-1/2})
O~(m3/2n1/2σp1d1/2),\displaystyle\leq\widetilde{O}(m^{3/2}n^{1/2}\sigma_{p}^{-1}d^{-1/2}),

where the first inequality is by triangle inequality, the second inequality is by our decomposition of 𝐖(T1),𝐖\mathbf{W}^{(T_{1})},\mathbf{W}^{*} and Lemma B.2 (notice that different noises are almost orthogonal), and the last inequality is by Proposition C.2 and Lemma E.2. This completes the proof. ∎

Lemma E.4.

Under the same conditions as Theorem 4.4, we have that

yif(𝐖(t),𝐱i),𝐖qlog(2q/ϵ)\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle\geq q\log(2q/\epsilon)

for all T1tTT_{1}\leq t\leq T^{*}.

Proof of Lemma E.4.

Recall that f(𝐖(t),𝐱i)=(1/m)j,rj[σ(𝐰j,r,yi𝝁)+σ(𝐰j,r,𝝃i)]f(\mathbf{W}^{(t)},\mathbf{x}_{i})=(1/m){\sum_{j,r}}j\cdot\big{[}\sigma(\langle\mathbf{w}_{j,r},y_{i}\cdot\bm{\mu}\rangle)+\sigma(\langle\mathbf{w}_{j,r},\bm{\xi}_{i}\rangle)\big{]}, so we have

yif(𝐖(t),𝐱i),𝐖\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle
=1mj,rσ(𝐰j,r(t),yi𝝁)𝝁,j𝐰j,r+1mj,rσ(𝐰j,r(t),𝝃i)yi𝝃i,j𝐰j,r\displaystyle=\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{*}\rangle+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{*}\rangle
=1mj,ri=1nσ(𝐰j,r(t),𝝃i)2qmlog(2q/ϵ)𝟙(j=yi)𝝃i,𝝃i𝝃i2\displaystyle=\frac{1}{m}\sum_{j,r}\sum_{i^{\prime}=1}^{n}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)\operatorname{\mathds{1}}(j=y_{i^{\prime}})\cdot\frac{\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle}{\|\bm{\xi}_{i^{\prime}}\|_{2}}
+1mj,rσ(𝐰j,r(t),yi𝝁)𝝁,j𝐰j,r(0)+1mj,rσ(𝐰j,r(t),𝝃i)yi𝝃i,j𝐰j,r(0)\displaystyle\qquad+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\langle\bm{\mu},j\mathbf{w}_{j,r}^{(0)}\rangle+\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\langle y_{i}\bm{\xi}_{i},j\mathbf{w}_{j,r}^{(0)}\rangle
1mj=yi,rσ(𝐰j,r(t),𝝃i)2qmlog(2q/ϵ)1mj,riiσ(𝐰j,r(t),𝝃i)2qmlog(2q/ϵ)|𝝃i,𝝃i|𝝃i2\displaystyle\geq\frac{1}{m}\sum_{j=y_{i},r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)-\frac{1}{m}\sum_{j,r}\sum_{i^{\prime}\not=i}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)\frac{|\langle\bm{\xi}_{i^{\prime}},\bm{\xi}_{i}\rangle|}{\|\bm{\xi}_{i^{\prime}}\|_{2}}
1mj,rσ(𝐰j,r(t),yi𝝁)O~(σ0𝝁2)1mj,rσ(𝐰j,r(t),𝝃i)O~(σ0σpd)\displaystyle\qquad-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})
1mj=yi,rσ(𝐰j,r(t),𝝃i)2qmlog(2q/ϵ)1mj,rσ(𝐰j,r(t),𝝃i)O~(mnd1/2)\displaystyle\geq\frac{1}{m}\sum_{j=y_{i},r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)2qm\log(2q/\epsilon)-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(mnd^{-1/2})
1mj,rσ(𝐰j,r(t),yi𝝁)O~(σ0𝝁2)1mj,rσ(𝐰j,r(t),𝝃i)O~(σ0σpd),\displaystyle\qquad-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle)\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})-\frac{1}{m}\sum_{j,r}\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle)\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}), (E.5)

where the first inequality is by Lemma B.3 and the last inequality is by Lemma B.2. Next we will bound the inner-product terms in (D.4) respectively. By Lemma C.6, we have that

|𝐰j,r(t),yi𝝁||𝐰j,r(0),yi𝝁|+γj,r(t)O~(1),\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}\rangle|\leq|\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle|+\gamma_{j,r}^{(t)}\leq\widetilde{O}(1), (E.6)

where the last inequality is by Proposition C.2.

For j=yij=y_{i}, we can bound the inner product between the parameter and the noise as follows

maxj,r𝐰j,r(t),𝝃i\displaystyle\max_{j,r}\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle maxj,r[𝐰j,r(0),𝝃i+ρ¯j,r,i(t)]8nlog(4n2/δ)dα1,\displaystyle\geq\max_{j,r}\big{[}\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+\overline{\rho}_{j,r,i}^{(t)}\big{]}-8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\geq 1, (E.7)

where the first inequality is by Lemma C.4, the second inequality is by Lemma E.2.

For j=yij=-y_{i}, we can bound the inner product between the parameter and the noise as follows

𝐰j,r(t),𝝃i\displaystyle\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle 𝐰j,r(0),𝝃i+8nlog(4n2/δ)dα1,\displaystyle\leq\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\rangle+8n\sqrt{\frac{\log(4n^{2}/\delta)}{d}}\alpha\leq 1, (E.8)

where the first inequality is by Lemma C.5 and the last inequality is by Lemma B.3 and the conditions of σ0\sigma_{0} and dd in Condition 4.2. Therefore, plugging (E.6), (E.7), (E.8) into (E.5) gives

yif(𝐖(t),𝐱i),𝐖\displaystyle y_{i}\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle 2qlog(2q/ϵ))O~(mnd1/2)O~(σ0𝝁2)O~(σ0σpd)\displaystyle\geq 2q\log(2q/\epsilon))-\widetilde{O}(mnd^{-1/2})-\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2})-\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d})
qlog(2q/ϵ),\displaystyle\geq q\log(2q/\epsilon),

where the last inequality is by dΩ~(m2n4)d\geq\widetilde{\Omega}(m^{2}n^{4}) and σ0O~(m2/(q2)n1)min{(σpd)1,𝝁21}\sigma_{0}\leq\widetilde{O}(m^{-2/(q-2)}n^{-1})\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\} in Condition 4.2. ∎

Lemma E.5.

Under the same conditions as Theorem 4.4, we have that

𝐖(t)𝐖F2𝐖(t+1)𝐖F2(2q1)ηLS(𝐖(t))ηϵ\displaystyle\|\mathbf{W}^{(t)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(t+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon

for all T1tTT_{1}\leq t\leq T^{*}.

Proof of Lemma E.5.

The proof is exactly same as the proof of Lemma D.4.

𝐖(t)𝐖F2𝐖(t+1)𝐖F2\displaystyle\|\mathbf{W}^{(t)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(t+1)}-\mathbf{W}^{*}\|_{F}^{2}
=2ηLS(𝐖(t)),𝐖(t)𝐖η2LS(𝐖(t))F2\displaystyle\qquad=2\eta\langle\nabla L_{S}(\mathbf{W}^{(t)}),\mathbf{W}^{(t)}-\mathbf{W}^{*}\rangle-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
=2ηni=1ni(t)[qyif(𝐖(t),𝐱i)f(𝐖(t),𝐱i),𝐖]η2LS(𝐖(t))F2\displaystyle\qquad=\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-\langle\nabla f(\mathbf{W}^{(t)},\mathbf{x}_{i}),\mathbf{W}^{*}\rangle]-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
2ηni=1ni(t)[qyif(𝐖(t),𝐱i)qlog(6/ϵ)]η2LS(𝐖(t))F2\displaystyle\qquad\geq\frac{2\eta}{n}\sum_{i=1}^{n}\ell^{\prime(t)}_{i}[qy_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})-q\log(6/\epsilon)]-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
2qηni=1n[(yif(𝐖(t),𝐱i))ϵ/(2q)]η2LS(𝐖(t))F2\displaystyle\qquad\geq\frac{2q\eta}{n}\sum_{i=1}^{n}[\ell\big{(}y_{i}f(\mathbf{W}^{(t)},\mathbf{x}_{i})\big{)}-\epsilon/(2q)]-\eta^{2}\|\nabla L_{S}(\mathbf{W}^{(t)})\|_{F}^{2}
(2q1)ηLS(𝐖(t))ηϵ,\displaystyle\qquad\geq(2q-1)\eta L_{S}(\mathbf{W}^{(t)})-\eta\epsilon,

where the first inequality is by Lemma E.4, the second inequality is due to the convexity of the cross entropy function and the last inequality is due to Lemma C.7. ∎

Lemma E.6.

Under the same conditions as Theorem 4.4, let T=T1+𝐖(T1)𝐖F22ηϵ=T1+O~(η1ϵ1m3nd1σp2)T=T_{1}+\Big{\lfloor}\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{2\eta\epsilon}\Big{\rfloor}=T_{1}+\widetilde{O}(\eta^{-1}\epsilon^{-1}m^{3}nd^{-1}\sigma_{p}^{-2}). Then we have maxj,rγj,r(t)2β^\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}, maxj,r,i|ρ¯j,r,i(t)|=O~(σ0σpd)\max_{j,r,i}|\underline{\rho}_{j,r,i}^{(t)}|=\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}) for all T1tTT_{1}\leq t\leq T. Besides,

1tT1+1s=T1tLS(𝐖(s))𝐖(T1)𝐖F2(2q1)η(tT1+1)+ϵ(2q1)\displaystyle\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)})\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta(t-T_{1}+1)}+\frac{\epsilon}{(2q-1)}

for all T1tTT_{1}\leq t\leq T, and we can find an iterate with training loss smaller than ϵ\epsilon within TT iterations.

Proof of Lemma E.6.

By Lemma E.5, for any T1tTT_{1}\leq t\leq T, we obtain that

𝐖(s)𝐖F2𝐖(s+1)𝐖F2(2q1)ηLS(𝐖(s))ηϵ\displaystyle\|\mathbf{W}^{(s)}-\mathbf{W}^{*}\|_{F}^{2}-\|\mathbf{W}^{(s+1)}-\mathbf{W}^{*}\|_{F}^{2}\geq(2q-1)\eta L_{S}(\mathbf{W}^{(s)})-\eta\epsilon (E.9)

holds for T1stT_{1}\leq s\leq t. Taking a summation, we have that

s=T1tLS(𝐖(s))\displaystyle\sum_{s=T_{1}}^{t}L_{S}(\mathbf{W}^{(s)}) 𝐖(T1)𝐖F2+ηϵ(tT1+1)(2q1)η\displaystyle\leq\frac{\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}+\eta\epsilon(t-T_{1}+1)}{(2q-1)\eta}
(i)2𝐖(T1)𝐖F2(2q1)η\displaystyle\overset{(i)}{\leq}\frac{2\|\mathbf{W}^{(T_{1})}-\mathbf{W}^{*}\|_{F}^{2}}{(2q-1)\eta}
=(ii)O~(η1m3nd1σp2),\displaystyle\overset{(ii)}{=}\widetilde{O}(\eta^{-1}m^{3}nd^{-1}\sigma_{p}^{-2}), (E.10)

where (i) is by tT2t\leq T_{2} and (ii) is by Lemma E.3 Then we can use induction to prove that maxj,rγj,r(t)2β^\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime} for all t[T1,T]t\in[T_{1},T]. Clearly, by the definition of β^\widehat{\beta}^{\prime}, we have maxj,rγj,r(T1)β^2β^\max_{j,r}\gamma_{j,r}^{(T_{1})}\leq\widehat{\beta}^{\prime}\leq 2\widehat{\beta}^{\prime}. Now suppose that there exists T~[T1,T]\widetilde{T}\in[T_{1},T] such that maxj,rγj,r(t)2β^\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime} for all t[T1,T~1]t\in[T_{1},\widetilde{T}-1]. Then by (C.3), we have

γj,r(T~)\displaystyle\gamma_{j,r}^{(\widetilde{T})} =γj,r(T1)ηnms=T1T~1i=1ni(t)σ(𝐰j,r(t),yi𝝁)𝝁22,\displaystyle=\gamma_{j,r}^{(T_{1})}-\frac{\eta}{nm}\sum_{s=T_{1}}^{\widetilde{T}-1}\sum_{i=1}^{n}\ell_{i}^{\prime(t)}\cdot\sigma^{\prime}(\langle\mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}\rangle)\|\bm{\mu}\|_{2}^{2},
(i)γj,r(T1)+q3q1ηnm𝝁22β^q1s=T1T~1i=1n|i(t)|\displaystyle\overset{(i)}{\leq}\gamma_{j,r}^{(T_{1})}+\frac{q3^{q-1}\eta}{nm}\|\bm{\mu}\|_{2}^{2}\widehat{\beta}^{\prime q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}\sum_{i=1}^{n}|\ell_{i}^{\prime(t)}|
(ii)γj,r(T1)+q3q1ηm1𝝁22β^q1s=T1T~1LS(𝐖(s))\displaystyle\overset{(ii)}{\leq}\gamma_{j,r}^{(T_{1})}+q3^{q-1}\eta m^{-1}\|\bm{\mu}\|_{2}^{2}\widehat{\beta}^{\prime q-1}\sum_{s=T_{1}}^{\widetilde{T}-1}L_{S}(\mathbf{W}^{(s)})
(iii)γj,r(T1)+β^q1O~(m2nSNR2)\displaystyle\overset{(iii)}{\leq}\gamma_{j,r}^{(T_{1})}+\widehat{\beta}^{\prime q-1}\widetilde{O}(m^{2}n\mathrm{SNR}^{2})
(iv)γj,r(T1)+β^q1O~(m2n12/q)\displaystyle\overset{(iv)}{\leq}\gamma_{j,r}^{(T_{1})}+\widehat{\beta}^{\prime q-1}\widetilde{O}(m^{2}n^{1-2/q})
(v)2β^\displaystyle\overset{(v)}{\leq}2\widehat{\beta}^{\prime}

for all j{±1}j\in\{\pm 1\} and r[m]r\in[m], where (i) is by induction hypothesis maxj,rγj,r(t)2β^\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}, (ii) is by |||\ell^{\prime}|\leq\ell, (iii) is by (E.10), (iv) is by n1SNRqΩ~(1)n^{-1}\mathrm{SNR}^{-q}\geq\widetilde{\Omega}(1), and (v)(v) is by β^=O~(σ0𝝁2)\widehat{\beta}^{\prime}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}) and β^q2O~(m2n12/q)=O~(m2n12/q(σ0𝝁2)q2)1\widehat{\beta}^{\prime q-2}\widetilde{O}(m^{2}n^{1-2/q})=\widetilde{O}(m^{2}n^{1-2/q}(\sigma_{0}\|\bm{\mu}\|_{2})^{q-2})\leq 1 by Condition 4.2. Therefore, we have maxj,rγj,r(T~)2β^\max_{j,r}\gamma_{j,r}^{(\widetilde{T})}\leq 2\widehat{\beta}^{\prime}, which completes the induction. ∎

E.3 Population Loss

Lemma E.7 (4th statement of Theorem 4.4).

Under the same conditions as Theorem 4.4, within O~(η1nσ02qσpqdq/2+η1ϵ1m3nσp2d1)\widetilde{O}(\eta^{-1}n\sigma_{0}^{2-q}\sigma_{p}^{-q}d^{-q/2}+\eta^{-1}\epsilon^{-1}m^{3}n\sigma_{p}^{-2}d^{-1}) iterations, we can find 𝐖(T)\mathbf{W}^{(T)} such that LS(𝐖(T))ϵL_{S}(\mathbf{W}^{(T)})\leq\epsilon. Besides, for any 0tT0\leq t\leq T we have that L𝒟(𝐖(t))0.1L_{\mathcal{D}}(\mathbf{W}^{(t)})\geq 0.1.

Proof of Lemma E.7.

Given a new example (x,y)(x,y), we have that

𝐰j,r(t)2\displaystyle\|\mathbf{w}_{j,r}^{(t)}\|_{2} =𝐰j,r(0)+jγj,r(t)𝝁𝝁22+i=1nρ¯j,r,i(t)𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i𝝃i222\displaystyle=\bigg{\|}\mathbf{w}_{j,r}^{(0)}+j\cdot\gamma_{j,r}^{(t)}\cdot\frac{\bm{\mu}}{\|\bm{\mu}\|_{2}^{2}}+\sum_{i=1}^{n}\overline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\frac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}\bigg{\|}_{2}
(i)𝐰j,r(0)2+γj,r(t)𝝁2+i=1nρ¯j,r,i(t)𝝃i2+i=1n|ρ¯j,r,i(t)|𝝃i2\displaystyle\overset{(i)}{\leq}\|\mathbf{w}_{j,r}^{(0)}\|_{2}+\frac{\gamma_{j,r}^{(t)}}{\|\bm{\mu}\|_{2}}+\sum_{i=1}^{n}\frac{\overline{\rho}_{j,r,i}^{(t)}}{\|\bm{\xi}_{i}\|_{2}}+\sum_{i=1}^{n}\frac{|\underline{\rho}_{j,r,i}^{(t)}|}{\|\bm{\xi}_{i}\|_{2}}
=(ii)O(σ0d)+O~(nσp1d1/2),\displaystyle\overset{(ii)}{=}O(\sigma_{0}\sqrt{d})+\widetilde{O}(n\sigma_{p}^{-1}d^{-1/2}),

where (i) is by triangle inequality and (ii) is by maxj,rγj,r(t)=O~(σ0𝝁2)\max_{j,r}\gamma_{j,r}^{(t)}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}) in Lemma E.6 and maxi,j,r|ρj,r,i|4log(T)\max_{i,j,r}|\rho_{j,r,i}|\leq 4\log(T^{*}) in Proposition 5.3.

Therefore, we have that 𝐰j,r(t),𝝃𝒩(0,σp2𝐰j,r(t)22)\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle\sim\mathcal{N}(0,\sigma_{p}^{2}\|\mathbf{w}_{j,r}^{(t)}\|_{2}^{2}). So with probability 11/(4m)1-1/(4m),

|𝐰j,r(t),𝝃|O~(σ0σpd+nd1/2).\displaystyle|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|\leq\widetilde{O}(\sigma_{0}\sigma_{p}\sqrt{d}+nd^{-1/2}).

Since the signal vector 𝝁\bm{\mu} is orthogonal to noises, by maxj,rγj,r(t)2β^=O~(σ0𝝁2)\max_{j,r}\gamma_{j,r}^{(t)}\leq 2\widehat{\beta}^{\prime}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}) in Lemma E.6, we also have that |𝐰j,r(t),𝝁||𝐰j,r(0),yi𝝁|+γj,r(t)=O~(σ0𝝁2)|\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle|\leq|\langle\mathbf{w}_{j,r}^{(0)},y_{i}\bm{\mu}\rangle|+\gamma_{j,r}^{(t)}=\widetilde{O}(\sigma_{0}\|\bm{\mu}\|_{2}). Now by union bound, with probability at least 11/21-1/2, we have that

Fj(𝐖j(t),𝐱)\displaystyle F_{j}(\mathbf{W}_{j}^{(t)},\mathbf{x}) =1mr=1mσ(𝐰j,r(t),y𝝁)+1mr=1mσ(𝐰j,r(t),𝝃)\displaystyle=\frac{1}{m}\sum_{r=1}^{m}\sigma(\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}\rangle)+\frac{1}{m}\sum_{r=1}^{m}\sigma(\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle)
maxr|𝐰j,r(t),𝝁|q+maxr|𝐰j,r(t),𝝃|q\displaystyle\leq\max_{r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}\rangle|^{q}+\max_{r}|\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}\rangle|^{q}
O~(σ0qσpqdq/2+nqdq/2+σ0q𝝁22)\displaystyle\leq\widetilde{O}(\sigma_{0}^{q}\sigma_{p}^{q}d^{q/2}+n^{q}d^{-q/2}+\sigma_{0}^{q}\|\bm{\mu}\|_{2}^{2})
1,\displaystyle\leq 1,

where the last inequality is by σ0O~(n1m2/(q2)min{(σpd)1,𝝁21})\sigma_{0}\leq\widetilde{O}(n^{-1}m^{-2/(q-2)}\cdot\min\{(\sigma_{p}\sqrt{d})^{-1},\|\bm{\mu}\|_{2}^{-1}\}) and dΩ~(m2n4)d\geq\widetilde{\Omega}(m^{2}n^{4}) in Condition 4.2. Therefore, with probability at least 11/21-1/2, we have that

(yf(𝐖(t),𝐱))log(1+e1).\displaystyle\ell\big{(}y\cdot f(\mathbf{W}^{(t)},\mathbf{x})\big{)}\geq\log(1+e^{-1}).

Thus L𝒟(𝐖(t))log(1+e1)0.50.1L_{\mathcal{D}}(\mathbf{W}^{(t)})\geq\log(1+e^{-1})\cdot 0.5\geq 0.1. This completes the proof. ∎

References

  • Adlam and Pennington (2020) Adlam, B. and Pennington, J. (2020). The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning. PMLR.
  • Allen-Zhu and Li (2020a) Allen-Zhu, Z. and Li, Y. (2020a). Feature purification: How adversarial training performs robust deep learning. arXiv preprint arXiv:2005.10190 .
  • Allen-Zhu and Li (2020b) Allen-Zhu, Z. and Li, Y. (2020b). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816 .
  • Allen-Zhu et al. (2019a) Allen-Zhu, Z., Li, Y. and Liang, Y. (2019a). Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems.
  • Allen-Zhu et al. (2019b) Allen-Zhu, Z., Li, Y. and Song, Z. (2019b). A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning.
  • Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z. and Wang, R. (2019a). Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning.
  • Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. and Wang, R. (2019b). On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems.
  • Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B. and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning.
  • Bartlett et al. (2017) Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems.
  • Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G. and Tsigler, A. (2020). Benign overfitting in linear regression. Proceedings of the National Academy of Sciences .
  • Belkin et al. (2019a) Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019a). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences 116 15849–15854.
  • Belkin et al. (2019b) Belkin, M., Hsu, D. and Xu, J. (2019b). Two models of double descent for weak features. arXiv preprint arXiv:1903.07571 .
  • Belkin et al. (2018) Belkin, M., Ma, S. and Mandal, S. (2018). To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning.
  • Bousquet and Elisseeff (2002) Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of machine learning research 2 499–526.
  • Brutzkus et al. (2018) Brutzkus, A., Globerson, A., Malach, E. and Shalev-Shwartz, S. (2018). Sgd learns over-parameterized networks that provably generalize on linearly separable data. In International Conference on Learning Representations.
  • Cao and Gu (2019a) Cao, Y. and Gu, Q. (2019a). Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems.
  • Cao and Gu (2019b) Cao, Y. and Gu, Q. (2019b). Tight sample complexity of learning one-hidden-layer convolutional neural networks. In Advances in Neural Information Processing Systems.
  • Cao et al. (2021) Cao, Y., Gu, Q. and Belkin, M. (2021). Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. Advances in Neural Information Processing Systems 34.
  • Chatterji and Long (2020) Chatterji, N. S. and Long, P. M. (2020). Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. arXiv preprint arXiv:2004.12019 .
  • Chen et al. (2018) Chen, Y., Jin, C. and Yu, B. (2018). Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619 .
  • Chen et al. (2019) Chen, Z., Cao, Y., Zou, D. and Gu, Q. (2019). How much over-parameterization is sufficient to learn deep relu networks? arXiv preprint arXiv:1911.12360 .
  • Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019a). Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning.
  • Du et al. (2018a) Du, S. S., Lee, J. D. and Tian, Y. (2018a). When is a convolutional filter easy to learn? In International Conference on Learning Representations.
  • Du et al. (2018b) Du, S. S., Lee, J. D., Tian, Y., Singh, A. and Poczos, B. (2018b). Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. In International Conference on Machine Learning.
  • Du et al. (2019b) Du, S. S., Zhai, X., Poczos, B. and Singh, A. (2019b). Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations.
  • Frei et al. (2021) Frei, S., Cao, Y. and Gu, Q. (2021). Provable generalization of sgd-trained neural networks of any width in the presence of adversarial label noise. In International Conference on Machine Learning. PMLR.
  • Frei et al. (2022) Frei, S., Chatterji, N. S. and Bartlett, P. L. (2022). Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. arXiv preprint arXiv:2202.05928 .
  • Golowich et al. (2018) Golowich, N., Rakhlin, A. and Shamir, O. (2018). Size-independent sample complexity of neural networks. In Conference On Learning Theory.
  • Hardt et al. (2016) Hardt, M., Recht, B. and Singer, Y. (2016). Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR. org.
  • Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S. and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 .
  • Jacot et al. (2018) Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems.
  • Ji and Telgarsky (2020) Ji, Z. and Telgarsky, M. (2020). Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations.
  • Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Sohl-Dickstein, J. and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems.
  • Li and Liang (2018) Li, Y. and Liang, Y. (2018). Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems.
  • Li et al. (2019) Li, Y., Wei, C. and Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. In Advances in Neural Information Processing Systems.
  • Li and Yuan (2017) Li, Y. and Yuan, Y. (2017). Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems.
  • Li et al. (2021) Li, Z., Zhou, Z.-H. and Gretton, A. (2021). Towards an understanding of benign overfitting in neural networks. arXiv preprint arXiv:2106.03212 .
  • Liang and Rakhlin (2020) Liang, T. and Rakhlin, A. (2020). Just interpolate: Kernel “ridgeless” regression can generalize. The Annals of Statistics 48 1329–1347.
  • Liao et al. (2020) Liao, Z., Couillet, R. and Mahoney, M. (2020). A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. In 34th Conference on Neural Information Processing Systems (NeurIPS 2020).
  • Lin et al. (2013) Lin, M., Chen, Q. and Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400 .
  • Lyu and Li (2019) Lyu, K. and Li, J. (2019). Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890 .
  • Mei and Montanari (2019) Mei, S. and Montanari, A. (2019). The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355 .
  • Montanari and Zhong (2020) Montanari, A. and Zhong, Y. (2020). The interpolation phase transition in neural networks: Memorization and generalization under lazy training. arXiv preprint arXiv:2007.12826 .
  • Mou et al. (2017) Mou, W., Wang, L., Zhai, X. and Zheng, K. (2017). Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947 .
  • Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., McAllester, D. and Srebro, N. (2018). A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representation.
  • Neyshabur et al. (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y. and Srebro, N. (2019). Towards understanding the role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations.
  • Neyshabur et al. (2015) Neyshabur, B., Tomioka, R. and Srebro, N. (2015). Norm-based capacity control in neural networks. In Conference on Learning Theory.
  • Shamir (2022) Shamir, O. (2022). The implicit bias of benign overfitting. arXiv preprint arXiv:2201.11489 .
  • Soltanolkotabi (2017) Soltanolkotabi, M. (2017). Learning ReLUs via gradient descent. In Advances in Neural Information Processing Systems.
  • Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S. and Srebro, N. (2018). The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19 2822–2878.
  • Wu and Xu (2020) Wu, D. and Xu, J. (2020). On the optimal weighted 2\ell_{2} regularization in overparameterized linear regression. Advances in Neural Information Processing Systems 33.
  • Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations.
  • Zhang et al. (2019) Zhang, X., Yu, Y., Wang, L. and Gu, Q. (2019). Learning one-hidden-layer ReLU networks via gradient descent. In The 22nd International Conference on Artificial Intelligence and Statistics.
  • Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L. and Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. In International Conference on Machine Learning.
  • Zou et al. (2021a) Zou, D., Cao, Y., Li, Y. and Gu, Q. (2021a). Understanding the generalization of adam in learning neural networks with proper regularization. arXiv preprint arXiv:2108.11371 .
  • Zou et al. (2019) Zou, D., Cao, Y., Zhou, D. and Gu, Q. (2019). Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning .
  • Zou et al. (2021b) Zou, D., Wu, J., Braverman, V., Gu, Q. and Kakade, S. (2021b). Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory. PMLR.