This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Coresets for Robust Training of Neural Networks against Noisy Labels

Baharan Mirzasoleiman
Department of Computer Science
University of California Los Angeles
Los Angeles, CA
[email protected]
&Kaidi Cao
Department of Computer Science
Stanford University
Stanford, CA
[email protected]
Jure Leskovec
Department of Computer Science
Stanford University
Stanford, CA
[email protected]
Abstract

Modern neural networks have the capacity to overfit noisy labels frequently found in real-world datasets. Although great progress has been made, existing techniques are limited in providing theoretical guarantees for the performance of the neural networks trained with noisy labels. Here we propose a novel approach with strong theoretical guarantees for robust training of deep networks trained with noisy labels. The key idea behind our method is to select weighted subsets (coresets) of clean data points that provide an approximately low-rank Jacobian matrix. We then prove that gradient descent applied to the subsets do not overfit the noisy labels. Our extensive experiments corroborate our theory and demonstrate that deep networks trained on our subsets achieve a significantly superior performance compared to state-of-the art, e.g., 6% increase in accuracy on CIFAR-10 with 80% noisy labels, and 7% increase in accuracy on mini Webvision111Code available at https://github.com/snap-stanford/crust..

1 Introduction

The success of deep neural networks relies heavily on the quality of training data, and in particular accurate labels of the training examples. However, maintaining label quality becomes very expensive for large datasets, and hence mislabeled data points are ubiquitous in large real-world datasets [21]. As deep neural networks have the capacity to essentially memorize any (even random) labeling of the data [49], noisy labels have a drastic effect on the generalization performance of deep neural networks. Therefore, it becomes crucial to develop methods with strong theoretical guarantees for robust training of neural networks against noisy labels. Such guarantees become of the utmost importance in safety-critical systems, such as aircraft, autonomous cars, and medical devices.

There has been a great empirical progress in robust training of neural networks against noisy labels. Existing directions mainly focus on: estimating the noise transition matrix [13, 34], designing robust loss functions [12, 42, 44, 52], correcting noisy labels [26, 36, 41], using explicit regularization techniques [7, 50, 51], and selecting or reweighting training examples [9, 14, 17, 27, 37, 44]. In general, estimating the noise transition matrix is challenging, correcting noisy labels is vulnerable to overfitting, and designing robust loss functions or using explicit regularization cannot achieve state-of-the-art performance [16, 23, 51]. Therefore, the most promising methods rely on selecting or reweighting training examples by knowledge distillation from auxiliary models [14, 17, 27], or exploiting an extra clean labelled dataset containing no noisy labels [17, 25, 37, 43, 50]. In practice, training reliable auxiliary models could be challenging, and relying on an extra dataset is restrictive as it requires the training and extra dataset to follow the same distribution. Nevertheless, the major limitation of the state-of-the-art methods is their inability to provide theoretical guarantees for the performance of neural networks trained with noisy labels.

There has been a few recent efforts to theoretically explain the effectiveness of regularization and early stopping in generalization of over-parameterized neural networks trained on noisy labels [16, 23]. Specifically, Hu et al. [16] proved that when width of the hidden layers is sufficiently large (polynomial in the size of the training data), gradient descent with regularization by distance to initialization corresponds to kernel ridge regression using the Neural Tangent Kernel (NTK). Kernel ridge regression performs comparably to early-stopped gradient descent [35, 45], and leads to a generalization guarantee in presence of noisy labels. In another work, Li et al. [23] proved that under a rich (clusterable) dataset model, a one-hidden layer neural network trained with gradient descent first fits the correct labels, and then starts to overfit the noisy labels. This is consistent with the previous empirical findings showing that deep networks tend to learn simple examples first, then gradually memorize harder instances [5]. In practice, however, regularization and early-stopping provide robustness only under relatively low levels of noise (up to 20% of noisy labels) [16, 23].

Here we develop a principled technique, Crust, with strong theoretical guarantees for robust training of neural networks against noisy labels. The key idea of our method is to carefully select subsets of clean data points that allow the neural network to effectively learn from the training data, but prevent it to overfit noisy labels. To find such subsets, we rely on recent results that characterize the training dynamics based on properties of neural network Jacobian matrix containing all its first-order partial derivatives. In particular, (1) learning along prominent singular vectors of the Jacobian is fast and generalizes well, while learning along small singular vectors is slow and leads to overfitting; and (2) label noise falls on the space of small singular values and impedes generalization [32]. To effectively and robustly learn from the training data, Crust efficiently finds subsets of clean and diverse data points for which the neural network has an approximately low-rank Jacobian matrix.

We show that the set of medoids of data points in the gradient space that minimizes the average gradient dissimilarity to all the other data points satisfies the above properties. To avoid overfitting noisy labels, Crust iteratively extracts and trains on the set of updated medoids. We prove that for large enough coresets and a constant fraction of noisy labels, deep networks trained with gradient descent on the medoids found by Crust do not overfit the noisy labels. We then explain how mixing up [51] the centers with a few other data points reduces the error of gradient descent updates on the coresets. Effectively, clean coresets found by Crust improve the generalization performance by reducing the ratio of noisy labels and their alignment with the space of small singular values.

We conduct experiments on noisy versions of CIFAR-10 and CIFAR-100 [22] with noisy labels generated by random flipping the original ones, and the mini Webvision datasets [24] which is a benchmark consisting of images crawled from websites, containing real-world noisy labels. Empirical results demonstrate that the robustness of deep models trained by Crust is superior to state-of-the-art baselines, e.g. 6% increase in accuracy on CIFAR-10 with 80% noisy labels, and 7% increase in accuracy on mini Webvision. We note that Crust achieves state-of-the-art performance without the need for training any auxiliary model or utilizing an extra clean dataset.

2 Additional Related Work

In practice, deeper and wider neural networks generalize better [40, 48]. Theoretically, recent results proved that when the number of hidden nodes is polynomial in the size of the dataset, neural network parameters stay close to their initialization, where the training landscape is almost linear [1, 4, 8, 11], or convex and semi-smooth [2]. For such networks, (stochastic) gradient descent with random initialization can almost always drive the training loss to 0, and overfit any (random or noisy) labeling of the data. Importantly, these results utilize the property that the Jacobian of the neural network is well-conditioned at a random initialization if the dataset is sufficiently diverse.

More closely related to our work is the recent result of [32] which proved that along the directions associated with large singular values of a neural network Jacobian, learning is fast and generalizes well. In contrast, early stopping can help with generalization along directions associated with small singular values. This is consistent with prior results proving the effectiveness of regularization and early stopping for providing robustness against noisy labels [16, 23]. These results, however, are restricted to unrealistically wide networks, and in practice are only effective under low levels of noise.

On the other hand, our method, Crust, provides rigorous guarantees for robust training of arbitrary deep neural networks against noisy labels, by efficiently extracting subsets of clean data points that provide an approximately low-rank Jacobian matrix during the training. Effectively, the extracted subsets do not allow the network to overfit noise, and hence Crust can quickly train a model that generalizes well. Unlike existing analytical results that are limited to a neighborhood around a random initialization, our method captures the change in Jacobian structure of deep networks for arbitrary parameter values during training. As a result, it achieves state-of-the-art performance both under mild as well as severe noise.

3 Problem Setting: Learning from Noisy Labeled Data

In this section we formally describe the problem of learning from datasets with noisy labels. Suppose we have a dataset 𝒟={(𝒙i,yi)}i=1nd×\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}\subset\mathbb{R}^{d}\times\mathbb{R}, where (𝒙i,yi)(\bm{x}_{i},y_{i}) denotes the ii-th sample with input 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} and its observed label yiy_{i}\in\mathbb{R}. We assume that the labels {yi}i=1n\{y_{i}\}_{i=1}^{n} belong to one of CC classes. Specifically, yi{ν1,ν2,,νC}y_{i}\in\{\nu_{1},\nu_{2},\cdots,\nu_{C}\} with {νj}j=1C[1,1]\{\nu_{j}\}_{j=1}^{C}\in[-1,1]. We further assume that the labels are separated with margin δ|νrνs|\delta\leq|\nu_{r}-\nu_{s}| for all r,s[C],rsr,s\in[C],r\neq s. Suppose we only observe inputs and their noisy labels {yi}i=1n\{{y}_{i}\}_{i=1}^{n}, but do not observe true labels {y~i}i=1n\{\tilde{y}_{i}\}_{i=1}^{n}. For each class 1jC1\leq j\leq C, a fraction of the labels associated with that class are assigned to another label chosen from {νj}j=1C\{\nu_{j}\}_{j=1}^{C}.

Let f(𝑾,𝒙)f(\bm{W},\bm{x}) be an LL-layer fully connected neural network with scalar output, where 𝒙d\bm{x}\in\mathbb{R}^{d} is the input and 𝑾=(𝑾(1),,𝑾(L))\bm{W}=(\bm{W}^{(1)},\cdots,\bm{W}^{(L)}) is all the network parameters. Here, 𝑾(l)dl×dl1\bm{W}^{(l)}\in\mathbb{R}^{d_{l}\times d_{l-1}} is the weight matrix in the ll-th layer (d0=d,dL=1d_{0}=d,d_{L}=1). For simplicity, we assume all the parameters are aggregated in a vector, i.e., 𝑾m\bm{W}\in\mathbb{R}^{m}, where m=l=2Ldl×dl1m=\sum_{l=2}^{L}d_{l}\times d_{l-1}. Suppose that the network is trained by minimizing the squared loss over the noisy training dataset 𝒟={(𝒙i,yi)}i=1n\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}:

(𝑾)=12iV(yif(𝑾,𝒙i))2,\mathcal{L}(\bm{W})=\frac{1}{2}\sum_{i\in V}(y_{i}-f(\bm{W},\bm{x}_{i}))^{2}, (1)

where V={1,,n}V\!\!=\!\!\{1,\cdots,n\} is the set of all training examples. We apply gradient descent with a constant learning rate η\eta, starting from an initial point 𝑾0\bm{W}^{0} to minimize (𝑾)\mathcal{L}(\bm{W}). The iterations take the form

𝑾τ+1=𝑾τη(𝑾τ,𝑿),(𝑾,𝑿)=𝒥T(𝑾,𝑿)(f(𝑾,𝑿)𝒚),\bm{W}^{\tau+1}=\bm{W}^{\tau}-\eta\nabla\mathcal{L}(\bm{W}^{\tau},{\bm{X}}),\quad\nabla\mathcal{L}(\bm{W},{\bm{X}})={\cal{J}}^{T}(\bm{W},{\bm{X}})(f(\bm{W},{\bm{X}})-\bm{y}), (2)

where 𝒥(𝑾,𝑿)n×m{\cal{J}}(\bm{W},{\bm{X}})\in\mathbb{R}^{n\times m} is the Jacobian matrix associated with the nonlinear mapping ff defined as

𝒥(𝑾,𝑿)=[f(𝑾,𝒙1)𝑾f(𝑾,𝒙n)𝑾]T.\mathcal{J}(\bm{W},{\bm{X}})=\big{[}\frac{\partial f(\bm{W},\bm{x}_{1})}{\partial\bm{W}}~{}~{}\cdots~{}~{}\frac{\partial f(\bm{W},\bm{x}_{n})}{\partial\bm{W}}\big{]}^{T}. (3)

The goal is to learn a function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} (in the form of a neural network) that can predict the true labels {y~i}i=1n\{\tilde{y}_{i}\}_{i=1}^{n} on the dataset 𝒟\mathcal{D}. In the rest of the paper, we use 𝑿=(𝒙1,,𝒙n),𝒚=(y1,,yn)T,𝒚~=(y~1,,y~n)T{\bm{X}}=(\bm{x}_{1},\cdots,\bm{x}_{n}),\bm{y}=(y_{1},\cdots,y_{n})^{T},\tilde{\bm{y}}=(\tilde{y}_{1},\cdots,\tilde{y}_{n})^{T}. Furthermore, we use 𝑿S{\bm{X}}_{S}, 𝒚S\bm{y}_{S}, and 𝒥(𝑾,XS){\cal{J}}(\bm{W},X_{S}) to denote inputs, labels, and the Jacobian matrix associated with elements in a subset SVS\subseteq V of data points, respectively.

4 Our Approach: Crust

In this section we present our main results. We first introduce our method Crust that selects subsets of clean data points that has an approximately low-rank Jacobian and do not allow gradient descent to overfit noisy labels. Then, we show how mixing up the subsets with a few other data points can further reduce the error of gradient descent updates, and improve the generalization performance.

4.1 Extracting Clean Subsets with Approximately Low-rank Jacobian

The key idea of our method is to carefully select subsets of data points that allow the neural network to effectively learn from the clean training data, but prevent it to overfit noisy labels. Recent result on optimization and generalization of neural networks show that the Jacobian of typical neural networks exhibits an approximately low-rank structure, i.e., a number of singular values are large and the remaining majority of the spectrum consists of small singular values. Consequently, the Jacobian spectrum can be split into information space \mathcal{I}, and nuisance space 𝒩\mathcal{N}, associated with the large and small singular values [32]. Formally, for complementary subspaces 𝒮,𝒮+n\mathcal{S}_{-},\mathcal{S}_{+}\subset\mathbb{R}^{n}, and for all unit norm vectors 𝒗𝒮+{\bm{{v}}}\in\mathcal{S}_{+}, 𝒘𝒮{\bm{{w}}}\in\mathcal{S}_{-}, and scalars 0μαβ0\leq\mu\ll\alpha\leq\beta, we have

α𝒥T(𝑾,𝑿)𝒗2β,and𝒥T(𝑾,𝑿)𝒘2μ.\alpha\leq\|{\cal{J}}^{T}(\bm{W},{\bm{X}}){\bm{{v}}}\|_{2}\leq\beta,\quad\text{and}\quad\|{\cal{J}}^{T}(\bm{W},{\bm{X}}){\bm{{w}}}\|_{2}\leq\mu. (4)

There are two key observations [23, 32]: (1) While learning over the low-dimensional information space is fast and generalizes well, learning over the high-dimensional nuisance space is slow and leads to overfitting; and (2) The generalization capability and dynamics of training is dictated by how well the label, 𝒚\bm{y}, and residual vector, 𝒓=f(𝑾,𝑿)𝒚\bm{r}=f(\bm{W},{\bm{X}})-\bm{y}, are aligned with the information space. If the residual vector is very well aligned with the singular vectors associated with the top singular values of 𝒥(𝑾,𝑿){\cal{J}}(\bm{W},{\bm{X}}), the gradient update, (𝑾)=𝒥T(𝑾,𝑿)𝒓\nabla{\cal{L}}(\bm{W})={\cal{J}}^{T}(\bm{W},{\bm{X}})\bm{r}, significantly reduces the misfit allowing substantial reduction in the training error. Importantly, the residual and label of most of the clean data points fall on the information space, while the residual and label of noisy data points fall on the nuisance space and impede training and generalization.

To avoid overfitting noisy labels, one can leverage the first observation above, and iteratively selects subsets SS of kk data points that provide the best rank-kk approximation to the Jacobian matrix. In doing so, gradient descent applied to the subsets cannot overfit the noisy labels. Formally:

S(𝑾)=argminSV𝒥T(𝑾,𝑿)PS𝒥T(𝑾,𝑿)2s.t.|S|k,S^{*}(\bm{W})={\arg\min}_{S\subseteq V}\|{\cal{J}}^{T}(\bm{W},{\bm{X}})-P_{S}{{\cal{J}}}^{T}(\bm{W},{\bm{X}})\|_{2}\quad\text{s.t.}\quad|S|\leq k, (5)

where 𝒥(𝑾,𝑿S)k×m{\cal{J}}(\bm{W},{\bm{X}}_{S})\in\mathbb{R}^{k\times m} is the set of kk rows of the Jacobian matrix associated to 𝑿S{\bm{X}}_{S}, and PS=𝒥T(𝑾,𝑿S)𝒥(𝑾,𝑿S)P_{S}={\cal{J}}^{T}(\bm{W},{\bm{X}}_{S}){\cal{J}}(\bm{W},{\bm{X}}_{S}) denotes the projection onto the kk-dimensional space spanned by the rows of 𝒥(𝑾,𝑿S){\cal{J}}(\bm{W},{\bm{X}}_{S}). Existing techniques to find the best subset of kk rows or columns from an n×mn\times m matrix have a computational complexity of poly(n,m,k)poly(n,m,k) [3, 10, 20], where n,mn,m are the number of data points and parameters in the network. Note that the subset S(𝑾)S^{*}(\bm{W}) depends on the parameter vector 𝑾\bm{W} and a new subset should be extracted after every parameter update. Furthermore, calculating the Jacobian matrix requires backpropagation on the entire dataset which could be very expensive for deep networks. Therefore, the computational complexity of the above methods becomes prohibitive for over-parameterized neural networks trained on large datasets. Most importantly, while this approach prohibits overfitting, it does not help identifying the clean data points.

To achieve a good generalization performance, our approach takes advantage of both the above mentioned observations. In particular, our goal is to find representative subsets of kk diverse data points with clean labels that span the information space \mathcal{I}, and provide an approximately low-rank Jacobian matrix. The important observation is that as nuisance space is very high dimensional, data points with noisy labels spread out in the gradient space. In contrast, information space is low-dimensional and data points with clean labels that have similar gradients cluster closely together. The set of most centrally located clean data points in the gradient space can be found by solving the following kk-medoids problem:

S(𝑾)argminSViVminjSdij(𝑾)s.t.|S|k,S^{*}(\bm{W})\in{\arg\min}_{S\subseteq V}\sum_{i\in V}\min_{j\in S}d_{ij}(\bm{W})\quad\quad\text{s.t.}\quad|S|\leq k, (6)

where dij(𝑾)=(𝑾,𝒙i)(𝑾,𝒙j)2d_{ij}(\bm{W})=\|\nabla{\cal{L}}(\bm{W},\bm{x}_{i})-\nabla{\cal{L}}(\bm{W},\bm{x}_{j})\|_{2} is the pairwise dissimilarity between gradients of data points ii and jj. Note that the above formulation does not provide the best rank-kk approximation of the Jacobian matrix. However, as the kk-medoids objective selects a diverse set of clean data points, the minimum singular value of the Jacobian of the selected subset projected over the subspace 𝒮+\mathcal{S}_{+}, i.e., σmin(𝒥(𝑾,𝑿S),𝒮+)\sigma_{\min}({\cal{J}}(\bm{W},{\bm{X}}_{S}^{*}),\mathcal{S}_{+}), will be large. Next, we weight the derivative of every medoid jSj\in S^{*} by the size of its corresponding cluster rj=iV𝕀[j=argminsSdis]r_{j}=\sum_{i\in V}\mathbb{I}[j={\arg\min}_{s\in S^{*}}d_{is}] to create the weighted Jacobian matrix 𝒥r(𝑾,𝑿S)=diag([r1,,rk])J(𝑾,𝑿S)k×m{\cal{J}}_{r}(\bm{W},{\bm{X}}_{S^{*}})=\text{diag}([r_{1},\cdots,r_{k}])J(\bm{W},{\bm{X}}_{S^{*}})\in\mathbb{R}^{k\times m}. We can establish the following upper and lower bounds on the singular values σi[k](𝒥r(𝑾,𝑿S),𝒮+)\sigma_{i\in[k]}({\cal{J}}_{r}(\bm{W},{\bm{X}}_{S^{*}}),\mathcal{S}_{+}) of the weighted Jacobian over 𝒮+\mathcal{S}_{+}:

rminσmin(𝒥(𝑾,𝑿S),𝒮+)σi[k](𝒥r(𝑾,𝑿S),𝒮+)rmax𝒥(𝑾,𝑿S),\sqrt{r_{\min}}\sigma_{\min}({\cal{J}}(\bm{W},{{\bm{X}}}_{S^{*}}),\mathcal{S}_{+})\leq\sigma_{i\in[k]}({\cal{J}}_{r}(\bm{W},{{\bm{X}}_{S^{*}}}),\mathcal{S}_{+})\leq\sqrt{r_{\max}}\|{\cal{J}}(\bm{W},{{\bm{X}}}_{S^{*}})\|, (7)

where rmin=minj[k]rjr_{\min}\!=\!\min_{j\in[k]}r_{j} and rmax=maxj[k]rjr_{\max}\!=\!\max_{j\in[k]}r_{j}, and we get an error of ϵ\epsilon in approximating the largest singular value of the neural network Jacobian, ϵ|rmax𝒥(𝑾,𝑿S)𝒥(𝑾,𝑿)|\epsilon\leq|\sqrt{r_{\max}}\|{\cal{J}}(\bm{W},{{\bm{X}}}_{S^{*}})\|-\|{\cal{J}}(\bm{W},{\bm{X}})\||. Now, we apply gradient descent updates in Eq. (2) to the weighted Jacobian 𝒥r(𝑾,𝑿S){\cal{J}}_{r}(\bm{W},{\bm{X}}_{S^{*}}) of the kk extracted medoids:

𝑾τ+1=𝑾τη𝒥rT(𝑾,𝑿S)(f(𝑾,𝑿S)𝒚S),\vspace{-1mm}\bm{W}^{\tau+1}=\bm{W}^{\tau}-\eta{\cal{J}}_{r}^{T}(\bm{W},{\bm{X}}_{S^{*}})(f(\bm{W},{\bm{X}}_{S^{*}})-{\bm{y}}_{S^{*}}), (8)

Note that we still need backpropagation on the entire dataset to be able to compute pairwise dissimilarities dijd_{ij}. For neural networks, it is shown that the variation of the gradient norms is mostly captured by the gradient of the loss w.r.t. the input to the last layer of the network [19]. This argument can be used to efficiently upper-bound the normed difference between pairwise gradient dissimilarities [30]:

dij(𝑾)=(𝑾,𝒙i)(𝑾,𝒙j)2c1ΣL(𝒛iL)iLΣL(𝒛jL)jL2+c2,\displaystyle d_{ij}(\bm{W})=\|\nabla{\cal{L}}(\bm{W},\bm{x}_{i})-\nabla{\cal{L}}(\bm{W},\bm{x}_{j})\|_{2}\leq c_{1}\|\Sigma^{\prime}_{L}({\bm{z}}_{i}^{L})\nabla_{i}^{L}{\cal{L}}-\Sigma^{\prime}_{L}({\bm{z}}_{j}^{L})\nabla_{j}^{L}{\cal{L}}\|_{2}+c_{2}, (9)

where ΣL(𝒛iL)iL\Sigma^{\prime}_{L}({\bm{z}}_{i}^{L})\nabla_{i}^{L}{\cal{L}} is gradient of the loss function {\cal{L}} w.r.t. the input to the last layer LL for data point ii, and c2,c2c_{2},c_{2} are constants. The above upper-bound is marginally more expensive to calculate than the value of the loss since it can be computed in a closed form in terms of zLz^{L}. Hence, diju=ΣL(𝒛iL)iLΣL(𝒛jL)jL2d^{u}_{ij}=\|\Sigma^{\prime}_{L}({\bm{z}}_{i}^{L})\nabla_{i}^{L}{\cal{L}}-\Sigma^{\prime}_{L}({\bm{z}}_{j}^{L})\nabla_{j}^{L}{\cal{L}}\|_{2} can be efficiently calculated. We note that although the upper-bounds dijud^{u}_{ij} have a lower dimensionality than dijd_{ij}, noisy data points still spread out in this lower-dimensional space, and hence are not selected as medoids. This is confirmed by out experiments (Fig. 1 (a)).

Having upper-bounds on the pairwise gradient dissimilarities, we can efficiently find a near-optimal solution for problem (6) by turning it into a submodular maximization problem. A set function F:2V+F:2^{V}\rightarrow\mathbb{R}^{+} is submodular if F(S{e})F(S)F(T{e})F(T),F(S\cup\{e\})-F(S)\geq F(T\cup\{e\})-F(T), for any STVS\subseteq T\subseteq V and eVTe\in V\setminus T. FF is monotone if F(e|S)0F(e|S)\geq 0 for any eVSe\!\in\!V\!\setminus\!S and SVS\subseteq V. Minimizing the objective in Problem (6) is equivalent to maximizing the following submodular facility location function:

S(𝑾)argmaxSV,|S|kF(S,𝑾),F(S,𝑾)=iVmaxjSd0dij(𝑾),S^{*}(\bm{W})\in{\arg\max}_{\begin{subarray}{c}S\subseteq V,\\ |S|\leq k\end{subarray}}F(S,\bm{W}),\quad\quad F(S,\bm{W})=\sum_{i\in V}\max_{j\in S}d_{0}-d^{\prime}_{ij}(\bm{W}), (10)

where d0d_{0} is a constant satisfying d0diju(𝑾)d_{0}\geq d^{u}_{ij}(\bm{W}), for all i,jVi,j\in V. For maximizing the above monotone submodular function, the classical greedy algorithm provides a constant (11/e)(1-1/e)-approximation. The greedy algorithm starts with the empty set S0=S_{0}=\emptyset, and at each iteration tt, it chooses an element eVe\in V that maximizes the marginal utility F(e|St)=F(St{e})F(St)F(e|S_{t})=F(S_{t}\cup\{e\})-F(S_{t}). Formally, St=St1{argmaxeVF(e|St1)}S_{t}=S_{t-1}\cup\{{\arg\max}_{e\in V}F(e|S_{t-1})\}. The computational complexity of the greedy algorithm is 𝒪(nk)\mathcal{O}(nk). However, its complexity can be reduced to 𝒪(|V|)\mathcal{O}(|V|) using stochastic methods [29], and can be further improved using lazy evaluation [28] and distributed implementations [31]. Note that this complexity does not involve any backpropagation as we use the upper-bounds calculated in Eq. (9). Hence, the subsets can be found very efficiently, in parallel from all classes. Unlike majority of the existing techniques for robust training against noisy labels that has a large computational complexity, robust training with Crust is even faster than training on the entire dataset. We also note that Problem (10) can be addressed in the streaming scenario for very large datasets [6].

Our experiments confirm that Crust can successfully find almost all the clean data points (c.f. Fig. 1 (a)). The following theorem guarantees that for a small fraction ρ\rho of noisy labels in the selected subsets, deep networks trained with gradient descent do not overfit the noisy labels.

Theorem 4.1

Assume that we apply gradient descent on the least-squares loss in Eq. (2) to train a neural network on a dataset with noisy labels. Furthermore, suppose that the Jacobian mapping is LL-smooth222Note that, if 𝒥(𝐖,𝐗)𝐖\frac{\partial{\cal{J}}(\bm{W},{\bm{X}})}{\partial\bm{W}} is continuous, the smoothness condition holds over any compact domain (albeit for a possibly large LL).. Assume that the dataset has a label margin of δ\delta, and coresets found by Crust contain a fraction of ρ<δ/8\rho<\delta/8 noisy labels. If the coresets approximate the Jacobian matrix by an error of at most ϵ𝒪(δα2kβlog(k/ρ))\epsilon\leq{\cal{O}}(\frac{\delta\alpha^{2}}{k\beta\log({\sqrt{k}}/{\rho})}), where α=rminσmin(𝒥(𝐖,𝐗S)),β=𝒥(𝐖,𝐗)+ϵ\alpha\!=\!\sqrt{r_{\min}}\sigma_{\min}({\cal{J}}(\bm{W},{\bm{X}}_{S})),\beta\!=\!\|{\cal{J}}(\bm{W},{\bm{X}})\|\!+\!\epsilon, then for LαβL2kL\leq\frac{\alpha\beta}{L\sqrt{2k}} and step size η=12β2\eta=\frac{1}{2\beta^{2}}, after τ𝒪(1ηα2log(nρ))\tau\geq{\cal{O}}({\frac{1}{\eta\alpha^{2}}}\log(\frac{\sqrt{n}}{\rho})) iterations the network classifies all the selected elements correctly.

The proof can be found in the Appendix. Note that the elements of the selected subsets are mostly clean, and hence the noise ratio ρ\rho is much smaller in the subsets compared to the entire dataset. Very recently, [32] showed that the classification error of neural networks trained on noisy datasets of size nn is controlled by the portion of the labels that fall over the nuisance space, i.e., Π𝒩(𝒚)/n\|\Pi_{\mathcal{N}}(\bm{y})\|/\sqrt{n}. Coresets of size kk selected by Crust are mostly clean. For such subsets, the label vector is mostly aligned with the information space, and thus Π𝒩(𝒚S)/k\|\Pi_{\mathcal{N}}(\bm{y}_{S})\|/\sqrt{k} is smaller. Our method improves the generalization performance by extracting subsets SS of size kk for which Π𝒩(𝒚S)/kΠ𝒩(𝒚)/n\|\Pi_{\mathcal{N}}(\bm{y}_{S})\|/\sqrt{k}\leq\|\Pi_{\mathcal{N}}(\bm{y})\|/\sqrt{n}. While defer the formal generalization proof to future work, our experiments show that even under severe noise (80% noisy labels), Crust successfully finds the clean data points and achieves a superior generalization performance (c.f. Fig. 1).

Next, we discuss how to reduce the error of backpropagation on the weighted centers.

1:The noisy set 𝒟={(𝒙i,yi)}i=1n\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}, number of iterations TT.
2:Output model parameters 𝑾T\bm{W}^{T}.
3:for τ=1,,T\tau=1,\cdots,T do
4:     Sτ=S^{\tau}=\emptyset.
5:     for c{1,,C}c\in\{1,\cdots,C\} do
6:         Ucτ={(𝒙i,yi)𝒟|f(𝑾τ,𝒙i)=νc}U_{c}^{\tau}=\{(\bm{x}_{i},y_{i})\in\mathcal{D}|f(\bm{W}^{\tau},\bm{x}_{i})=\nu_{c}\}, nc=|Ucτ|/n.n_{c}=|U_{c}^{\tau}|/n. \triangleright Classify based on predictions.
7:         diju=d^{u}_{ij}=upper-bounded pairwise gradient dissimilarities for i,jUcτi,j\in U_{c}^{\tau} \triangleright Eq. 9.
8:         Scτ={kncS_{c}^{\tau}=\{k\!\cdot\!n_{c}-medoids from UcτU_{c}^{\tau} using diju}d_{ij}^{u}\} \triangleright The greedy algorithm.
9:         for jScτj\in S_{c}^{\tau} do
10:              Vjτ={iUcτ|j=argminvScτdivu}V_{j}^{\tau}=\{i\in U_{c}^{\tau}|j={\arg\min}_{v\in S_{c}^{\tau}}d^{u}_{iv}\}
11:              Rjτ=R_{j}^{\tau}= small random sample from Vjτ.V_{j}^{\tau}.
12:              D^jτ=\hat{D}_{j}^{\tau}=Mixup (𝒙j,yj)(\bm{x}_{j},y_{j}) with {(𝒙i,yi)|iRjτ}\{(\bm{x}_{i},y_{i})~{}|~{}i\in R_{j}^{\tau}\} \triangleright Eq. (11).
13:              ri=|Vjτ|/|Rjτ|,iRjτr_{i}=|V_{j}^{\tau}|/|R_{j}^{\tau}|,~{}\forall i\in R_{j}^{\tau} \triangleright Coreset weights in Eq. (8).
14:              Sτ=SτD^jτS^{\tau}=S^{\tau}\cup\hat{D}_{j}^{\tau}
15:         end for
16:     end for
17:     Update the parameters 𝑾τ\bm{W}^{\tau} using weighted gradient descent on SτS^{\tau}. \triangleright Eq. (2).
18:end for
Algorithm 1 Coresets for Robust Training against Noisy Labels (Crust)

4.2 Further Reducing the Error of Coresets

There are two potential sources of error during weighted gradient descent updates in Eq. (8). First, we have an ϵ\epsilon error in estimating the prominent singular value of the Jacobian matrix. And second, although the kk-medoids formulation selects centers of clustered clean data points in the gradient space, there is still a small chance, in particular early in training process when the gradients are more uniformly distributed, that the coresets contain some noisy labels. Both errors can be alleviated if we slightly relax the constraint of training on exact feature vectors and their labels, and allow training on combinations of every center jSj\!\in\!S with a few examples in its corresponding cluster VjV_{j}.

This is the idea behind mixup [51]. It extends the training distribution with convex combinations of pairs of examples and their labels. For every cluster VjV_{j}, we select a few data points RjVj{j},|Rj||Vj|R_{j}\!\!\subset\!\!V_{j}\setminus\{j\},|R_{j}|\ll|V_{j}| uniformly at random, and for every data point (𝒙i,yi),iRj(\bm{x}_{i},\!y_{i}),i\!\in\!R_{j}, we mix it up with the corresponding center (𝒙j,yj),jS(\bm{x}_{j},\!y_{j}),j\!\in\!S^{*}, to get the set D^j\hat{D}_{j} of mixed up points:

D^j={(𝒙^,y^)|x^=λ𝒙i+(1λ)𝒙j,y^=λyi+(1λ)yjiRj},\hat{D}_{j}=\{(\hat{\bm{x}},\hat{y})~{}~{}|~{}~{}\hat{x}=\lambda\bm{x}_{i}+(1-\lambda)\bm{x}_{j},~{}~{}\hat{y}=\lambda y_{i}+(1-\lambda)y_{j}\quad\forall i\in R_{j}\}, (11)

where λBeta(α,α)[0,1]\lambda\sim\text{Beta}(\alpha,\alpha)\in[0,1] and α+\alpha\in\mathbb{R}^{+}. The mixup hyper-parameter α\alpha controls the strength of interpolation between feature-label pairs. Our experiments show that the subsets chosen by Crust contain mostly clean data points, but may contain some noisy labels early during the training (Fig. 1 (a)). Mixing up the centers with as few as one example from the corresponding cluster, i.e. |Rj|=1|R_{j}|=1, can reduce the effect of potential noisy labels in the selected subsets. Therefore, mixup can further help improving the generalization performance, as is confirmed by our experiments (Table 2).

4.3 Iteratively Reducing Noise

The subsets found in Problem (10) depend on the parameter vector 𝑾\bm{W} and need to be updated during the training. Let 𝑾τ\bm{W}^{\tau} be the parameter vector at iteration τ\tau. At update time τ[T]\tau\in[T], we first classify data points based on the updated predictions 𝒚τ=f(𝑾τ,𝑿)\bm{y}^{\tau}=f(\bm{W}^{\tau},{\bm{X}}). We denote by Ucτ={(𝒙i,yi)𝒟|f(𝑾τ,𝒙i)=νc}U_{c}^{\tau}=\{(\bm{x}_{i},y_{i})\in\mathcal{D}|f(\bm{W}^{\tau},\bm{x}_{i})=\nu_{c}\} the set of data points labeled as νc\nu_{c} in iteration τ\tau. Then, we find S(𝑾τ)S(\bm{W}^{\tau}) by greedily extracting knck\!\cdot\!n_{c} medoids from each class, where nc=|Uc|/nn_{c}=|U_{c}|/n is the fraction of data points in class c[C]c\in[C]. Finding separate coresets from each class can further help to not cluster together noisy data points spread out in the nuisance space, and improves the accuracy of the extracted coresets. Next, we partition the data points in every class to by assigning every data point to its closest medoid. Formally, for partition VjτV_{j}^{\tau} we have Vjτ={iUcτ|j=argminjSτdiju}V_{j}^{\tau}=\{i\in U_{c}^{\tau}|j={\arg\min}_{j\in S^{\tau}}d^{u}_{ij}\}. Finally, we take a small random sample RjτR_{j}^{\tau} from every partition VjτV_{j}^{\tau}, and for every data point iRjτi\in R_{j}^{\tau} we mix it up with the corresponding medoid jS(𝑾τ)j\in S(\bm{W}^{\tau}) according to Eq. (11), and add the generated set D^jτ\hat{D}_{j}^{\tau} of mixed up data points to the training set. In our experiments we use |Rjτ|=1|R_{j}^{\tau}|=1. At update time τ\tau, we train on the union of sets generated by mixup, i.e. Dτ={D^1τD^kτ}D^{\tau}=\{\hat{D}_{1}^{\tau}\cup\cdots\cup\hat{D}_{k}^{\tau}\}, where every data point iD^jτi\in\hat{D}_{j}^{\tau} is weighted by ri=|Vjτ|/|Rjτ|r_{i}=|V_{j}^{\tau}|/|R_{j}^{\tau}|. The pseudo code of Crust is given in Algorithm 1.

Note that while Crust updates the coreset during the training, as almost all the clean data points are contained in the coresets in every iteration, gradient descent can successfully contract their residuals in Theorem 4.1 and fit the correct labels. Since Crust finds a new subset at every iteration τ\tau, we need to use α=minτrminτσmin(𝒥(𝑾τ,𝑿Sτ))\alpha\!=\!\min_{\tau}\sqrt{r_{\min}^{\tau}}\sigma_{\min}({\cal{J}}(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})), and β=maxτrmaxτ𝒥(𝑾τ,𝑿Sτ)\beta\!=\!\max_{\tau}\sqrt{r_{\max}^{\tau}}\|{\cal{J}}(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})\| in Theorem 4.1, where α\alpha and β\beta are the minimum and maximum singular values of the Jacobian of the subsets weighted by 𝒓τ\bm{r}^{\tau} found by Crust, during TT steps of gradient descent updates.

5 Experiments

We evaluate our method on artificially corrupted versions of CIFAR-10 and CIFAR-100 [22] with controllable degrees of label noise, as well as a real-world large-scale dataset mini WebVision [24], which contains real noisy labels. Our algorithm is developed with PyTorch [33]. We use 1 Nvidia GTX 1080 Ti for all CIFAR experiments and 4 for training on the mini WebVision dataset.

Baselines. We compare our approach with multiple state-of-the-art methods for robust training against label corruption. (1) F-correction [34] first naively trains a neural network using ERM, then estimates the noise transition matrix TT. TT is then used to construct a corrected loss function with which the model will be retrained. (2) MentorNet [17] first pretrains a teacher network to mimic a curriculum. The student network is then trained with the sample reweighting scheme provided by the teacher network. (4) D2L [26] reduces the effect of noisy labels on learning the true data distribution after learning rate annealing using corrected labels. (3) Decoupling [27] trains two networks simultaneously, and the two networks only train on a subset of samples that do not have the same prediction in every mini batch. (5) Co-teaching [14] also maintains two networks in the training time. Each network selects clean data (samples with small loss) and guide the other network to train on its selected clean subset. (6) INCV [9] first estimates the noise transition matrix TT through cross validation, then applies iterative Co-teaching by including samples with small losses. (7) T-Revision [46] designs a deep-learning-based risk-consistent estimator to tune the transition matrix accurately. (8) L_DMI [47] proposes information theoretic noise-robust loss function based on generalized mutual information.

5.1 Empirical results on artificially corrupted CIFAR

We first evaluate our method on CIFAR-10 and CIFAR-100, which contain 50,000 training images and 10,000 test images of size 32×3232\times 32 with 10 and 100 classes, respectively. We follow testing protocol adopted in [9, 14], by considering both symmetric and asymmetric label noise. Specifically, we test noise ratio of 0.2, 0.5, 0.8 for symmetric noise, and 0.4 for asymmetric noise.

Table 1: Average test accuracy (5 runs) on CIFAR-10 and CIFAR-100. The best test accuracy is marked in bold. Crust achieves up to 6% improvement (3.15% in average) over the strongest baseline INCV. We note the superior performance of Crust  under 80% label noise.
Dataset CIFAR-10 CIFAR-100
Noise Type Sym Asym Sym Asym
Noise Ratio 2020 5050 8080 4040 2020 5050 4040
F-correction 85.1±0.485.1\pm 0.4 76.0±0.276.0\pm 0.2 34.8±4.534.8\pm 4.5 83.6±2.283.6\pm 2.2 55.8±0.555.8\pm 0.5 43.3±0.743.3\pm 0.7 42.3±0.742.3\pm 0.7
Decoupling 86.7±0.386.7\pm 0.3 79.3±0.679.3\pm 0.6 36.9±4.636.9\pm 4.6 75.3±0.875.3\pm 0.8 57.6±0.557.6\pm 0.5 45.7±0.445.7\pm 0.4 43.1±0.443.1\pm 0.4
Co-teaching 89.1±0.389.1\pm 0.3 82.1±0.682.1\pm 0.6 16.2±3.216.2\pm 3.2 84.6±2.884.6\pm 2.8 64.0±0.364.0\pm 0.3 52.3±0.452.3\pm 0.4 47.7±1.247.7\pm 1.2
MentorNet 88.4±0.588.4\pm 0.5 77.1±0.477.1\pm 0.4 28.9±2.328.9\pm 2.3 77.3±0.877.3\pm 0.8 63.0±0.463.0\pm 0.4 46.4±0.446.4\pm 0.4 42.4±0.542.4\pm 0.5
D2L 86.1±0.486.1\pm 0.4 67.4±3.667.4\pm 3.6 10.0±0.110.0\pm 0.1 85.6±1.285.6\pm 1.2 12.5±4.212.5\pm 4.2 5.6±5.45.6\pm 5.4 14.1±5.814.1\pm 5.8
INCV 89.7±0.289.7\pm 0.2 84.8±0.384.8\pm 0.3 52.3±3.552.3\pm 3.5 86.0±0.586.0\pm 0.5 60.2±0.260.2\pm 0.2 53.1±0.453.1\pm 0.4 50.7±0.250.7\pm 0.2
T-Revision 79.3±0.579.3\pm 0.5 78.5±0.678.5\pm 0.6 36.2±1.636.2\pm 1.6 76.3±0.876.3\pm 0.8 52.4±0.352.4\pm 0.3 37.6±0.337.6\pm 0.3 32.3±0.432.3\pm 0.4
L_DMI 84.3±0.484.3\pm 0.4 78.8±0.578.8\pm 0.5 20.9±2.220.9\pm 2.2 84.8±0.784.8\pm 0.7 56.8±0.456.8\pm 0.4 42.2±0.542.2\pm 0.5 39.5±0.439.5\pm 0.4
Crust 91.1±0.2{\textbf{91.1}\pm\textbf{0.2}} 86.3±0.3{\textbf{86.3}\pm\textbf{0.3}} 58.3±1.8\textbf{58.3}\pm\textbf{1.8} 88.8±0.4{\textbf{88.8}\pm\textbf{0.4}} 65.2±0.2{\textbf{65.2}\pm\textbf{0.2}} 56.4±0.4{\textbf{56.4}\pm\textbf{0.4}} 53.0±0.2{\textbf{53.0}\pm\textbf{0.2}}

In our experiments, we train ResNet-32 [15] for 120 epochs with a minibatch of 128. We use SGD with an initial learning rate of 0.1 and decays at epoch 80, 100 by a factor of 10 to optimize the objective, with a momentum of 0.9 and weight decay of 5×1045\times 10^{-4}. We only use simple data augmentation following [15]: we first pad 4 pixels on every side of the image, and then randomly crop a 32×3232\times 32 image from the padded image. We flip the image horizontally with a probability of 0.5. For Crust, we select coresets of size 50% of the size of the dataset unless otherwise stated.

We report top-1 test accuracy of various methods in Table 1. Our proposed method Crust outperforms all the baselines in terms of average test accuracy. While INCV attempts to find a subset of the training set with heuristics, our theoretically-principled method can successfully distinguish data points with correct labels from those with noisy labels, which results in a clear improvement across all different settings. It can be seen that Crust achieves a consistent improvement by an average of 3.15%, under various symmetric and asymmetric noisy scenarios, compared to the strongest baseline INCV. Interestingly, Crust  achieves the largest improvement of 6% over INCV, under sever 80% label noise. This shows the effectiveness of our method in extracting data points with clean labels from a large number of noisy data points, compared to other baselines.

5.2 Ablation study on each component

Here we investigate the effect of each component of Crust and its importance, for robust training on CIFAR-10 with 20% and 50% symmetric noise. Table 2 summarizes the results.

Effect of the coreset. Based on the empirical results, the coreset plays an important role in improving the generalization of the trained networks. It is worthwhile noticing that by greedily finding the coreset based on the gradients, we can outperform INCV already. This clearly corroborate our theory and shows the effectiveness of Crust in filtering the noise and extracting data points with correct labels, compared to other heuristics. It also confirms our argument that although upper-bounded gradient dissimilarities in Eq. (9) has a much lower dimensionality compared to the exact gradients, noisy data points still spread out in the gradient space. Therefore, Crust can successfully identify central data points with clean labels in the gradient space.

Effect of mixup and model update. As discussed in Sec. 4.2, mixup can reduce the bias of estimating the full gradient with the coreset. Moreover, finding separate coresets from each class can further help filtering noisy labels, and improves the accuracy of the extracted coresets. At the beginning of every epoch, Crust updates the predictions based on the current model parameters and extract corsets from every class separately. We observed that both components, namely mixup and extracting coresets separately from each class based on the predictions of the model being trained, further improve the generalization and hence the final accuracy.

Table 2: Ablation study on CIFAR-10 with 20% and 50% symmetric noise. \checkmark indicates the corresponding component. coreset w/label and coreset w/label correspond to finding coresets separately from every class based on their observed noisy labels, or labels predicted by the model being trained.
Component Noise Ratio
coreset w/ label coreset w/ pred. w/o mixup w/ mixup 20 50
\checkmark \checkmark 90.21 84.92
\checkmark \checkmark 90.48 85.23
\checkmark \checkmark 90.71 85.57
\checkmark \checkmark 91.12 86.27
Refer to caption
Figure 1: Training curve for CIFAR-10 with 50% symmetric noise. (a) The fraction of correct labels in the coreset (label accuracy) with respect to epochs. (b) The accuracy on the whole training set with respect to epochs. (c) The accuracy on test set with respect to epochs. Here we consider coresets of size 30%, 50%, and 70% of the CIFAR-10 training dataset.

Size of the coresets. Fig. 1 demonstrates training curve for CIFAR-10 with 50% symmetric noise. Fig. 1(a) shows the accuracy of coresets of size 30%, 50%, and 70% selected by Crust. We observe that for various sizes of coresets, the number of noisy centers decreases over time. Furthermore, the fraction of correct labels in the coresets (label accuracy) decreases when the size of the selected centers increases from 30% to 70%. This demonstrates that Crust identifies clean data points first. Fig.  1 (b), (c) show the train and test accuracy, when training with Crust  on coresets of various sizes. We can see that coresets of size 30% achieve a lower accuracy as they are too small to accurately estimate the spectrum of the information space and achieve a good generalization performance. Coresets of size 70% achieve a lower training and test accuracy compared to coresets of size 50%. As 50% of the labels are noisy, subsets of size 70% contain at least 20% noisy labels and the model eventually overfits the noise.

5.3 Empirical results on mini WebVision

WebVision is real-world dataset with inherent noisy labels [24]. It contains 2.4 million images crawled from Google and Flickr that share the same 1000 classes from the ImageNet dataset. The noise ratio in classes varies from 0.5% to 88% (Fig. 4 in [24] shows the noise distribution). We follow the setting in [17] and create a mini WebVision that consists of the top 50 classes in the Google subset with 66,000 images. We use both WebVision and ImageNet test sets for testing the performance of the model trained on coresets of size 50% of the data found by Crust. We train InceptionResNet-v2 [39] for 90 epochs with a starting learning rate of 0.1. We anneal the learning rate at epoch 30 and 60, respectively. The results are shown in Table 3. It can be seen that our method consistently outperforms other baselines, and achieves an average of 5% improvement in the test accuracy, compared to INCV.

Table 3: Test accuracy on mini WebVision. The best test accuracy is marked in bold. Crust achieves up to 7.16% improvement (6.46% in average) in top-1 accuracy over the strongest baseline INCV.
WebVision ImageNet
Method Top-1 Top-5 Top-1 Top-5
F-correction 61.12 82.68 57.36 82.36
Decoupling 62.54 84.74 58.26 82.26
Co-teaching 63.58 85.20 61.48 84.70
MentorNet 63.00 81.40 57.80 79.92
D2L 62.68 84.00 57.80 81.36
INCV 65.24 85.34 61.60 84.98
Crust 72.40 89.56 67.36 87.84

6 Conclusion

We proposed a novel approach with strong theoretical guarantees for robust training of neural networks against noisy labels. Our method, Crust, relies on the following key observations: (1) Learning along prominent singular vectors of the Jacobian is fast and generalizes well, while learning along small singular vectors is slow and leads to overfitting; and (2) The generalization capability and dynamics of training is dictated by how well the label and residual vector are aligned with the information space. To achieve a good generalization performance and avoid overfitting, Crust iteratively selects subsets of clean data points that provide an approximately low-rank Jacobian matrix. We proved that for a constant fraction of noisy labels in the subsets, neural networks trained with gradient descent applied to the subsets found by Crust correctly classify all its data points. At the same time, our method improves the generalization performance of the deep network by decreasing the portion of noisy labels that fall over the nuisance space of the network Jacobian. Our extensive experiments demonstrated the effectiveness of our method in providing robustness against noisy labels. In particular, we showed that deep networks trained on the our subsets achieve a significantly superior performance, e.g., 6% increase in accuracy on CIFAR-10 with 80% noisy labels, and 7% increase in accuracy on mini Webvision, compared to state-of-the-art baselines.

Broader Impact

Deep neural networks achieve impressive results in a wide variety of domains, including vision and speech recognition. The quality of the trained deep models on such datasets increases logarithmically with the size of the data [38]. This improvement, however, is contingent on the availability of reliable and accurate labels. In practice, collecting large high quality datasets is often very expensive and time-consuming. For example, labeling of medical images depends on domain experts and hence is very resource-intensive. In some applications, it necessitate obtaining consensus labels or labels from multiple experts and methods for aggregating those annotations to get the ground truth labels [18]. In some domains, crowd-sourcing methods are used to obtain labels from non-experts. An alternative solution is automated mining of data, e.g., from the Internet by using different image-level tags that can be regarded as labels. These solutions are cheaper and more time-efficient than human annotations, but label noise in such datasets is expected to be higher than in expert-labeled datasets. Noisy labels have a drastic effect on the generalization performance of deep neural networks. This prevents deep networks from being employed in real-world noisy scenarios, in particular in safety critical applications such as aircraft, autonomous cars, and medical devices.

State-of-the art methods for training deep networks with noisy labels are mostly heuristics and cannot provide theoretical guarantees for the robustness of the trained model in presence of noisy labels. Failure of such systems can have a drastic effect in sensitive and safety critical applications. Our research provides a principled method for training deep networks on real-world datasets with noisy labels. Our proposed method, Crust, is based on the recent advances in theoretical understanding of neural networks, and provides theoretical guarantee for the performance of the deep networks trained with noisy labels. We expect our method to have a far-reaching impact in deployment of deep neural networks in real-world systems. We believe our research will be beneficial for deep learning in variety of domains, and do not have any societal or ethical disadvantages.

Acknowledgments

We gratefully acknowledge the support of DARPA under Nos. FA865018C7880 (ASED), N660011924033 (MCS); ARO under Nos. W911NF-16-1-0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under Nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions), IIS-2030477 (RAPID); Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Chan Zuckerberg Biohub, Amazon, Boeing, JPMorgan Chase, Docomo, Hitachi, JD.com, KDDI, NVIDIA, Dell. J. L. is a Chan Zuckerberg Biohub investigator.

References

  • [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155–6166, 2019.
  • [2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252, 2019.
  • [3] Jason Altschuler, Aditya Bhaskara, Gang Fu, Vahab Mirrokni, Afshin Rostamizadeh, and Morteza Zadimoghaddam. Greedy column subset selection: New bounds and distributed algorithms. In International Conference on Machine Learning, pages 2539–2548, 2016.
  • [4] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332, 2019.
  • [5] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 233–242. JMLR. org, 2017.
  • [6] Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streaming submodular maximization: Massive data summarization on the fly. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 671–680, 2014.
  • [7] Kaidi Cao, Yining Chen, Junwei Lu, Nikos Arechiga, Adrien Gaidon, and Tengyu Ma. Heteroskedastic and imbalanced deep learning with adaptive regularization. arXiv preprint arXiv:2006.15766, 2020.
  • [8] Yuan Cao and Quanquan Gu. A generalization theory of gradient descent for learning over-parameterized deep relu networks. arXiv preprint arXiv:1902.01384, 2019.
  • [9] Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. Understanding and utilizing deep neural networks trained with noisy labels. In International Conference on Machine Learning, pages 1062–1070, 2019.
  • [10] Michael B Cohen, Cameron Musco, and Christopher Musco. Input sparsity time low-rank approximation via ridge leverage score sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1758–1777. SIAM, 2017.
  • [11] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685, 2019.
  • [12] Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [13] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. 2016.
  • [14] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pages 8527–8537, 2018.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [16] Wei Hu, Zhiyuan Li, and Dingli Yu. Understanding generalization of deep neural networks trained with noisy labels. arXiv preprint arXiv:1905.11368, 2019.
  • [17] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pages 2309–2318, 2018.
  • [18] Davood Karimi, Haoran Dou, Simon K Warfield, and Ali Gholipour. Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. arXiv preprint arXiv:1912.02911, 2019.
  • [19] Angelos Katharopoulos and Francois Fleuret. Not all samples are created equal: Deep learning with importance sampling. In International Conference on Machine Learning, pages 2525–2534, 2018.
  • [20] Rajiv Khanna, Ethan Elenberg, Alexandros Dimakis, Joydeep Ghosh, and Sahand Negahban. On approximation guarantees for greedy low rank optimization. In International Conference on Machine Learning, 2017.
  • [21] Ranjay A Krishna, Kenji Hata, Stephanie Chen, Joshua Kravitz, David A Shamma, Li Fei-Fei, and Michael S Bernstein. Embracing error to enable rapid crowdsourcing. In Proceedings of the 2016 CHI conference on human factors in computing systems, pages 3167–3179, 2016.
  • [22] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [23] Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. arXiv preprint arXiv:1903.11680, 2019.
  • [24] Wen Li, Limin Wang, Wei Li, Eirikur Agustsson, and Luc Van Gool. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017.
  • [25] Yuncheng Li, Jianchao Yang, Yale Song, Liangliang Cao, Jiebo Luo, and Li-Jia Li. Learning from noisy labels with distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1910–1918, 2017.
  • [26] Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In International Conference on Machine Learning, pages 3355–3364, 2018.
  • [27] Eran Malach and Shai Shalev-Shwartz. Decoupling" when to update" from" how to update". In Advances in Neural Information Processing Systems, pages 960–970, 2017.
  • [28] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, pages 234–243. Springer, 1978.
  • [29] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. Lazier than lazy greedy. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • [30] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for data-efficient training of machine learning models. arXiv preprint arXiv:1906.01827, 2019.
  • [31] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pages 2049–2057, 2013.
  • [32] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 2020.
  • [33] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [34] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1944–1952, 2017.
  • [35] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.
  • [36] Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014.
  • [37] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pages 4334–4343, 2018.
  • [38] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
  • [39] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017.
  • [40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  • [41] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5552–5560, 2018.
  • [42] Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric label noise: The importance of being unhinged. In Advances in Neural Information Processing Systems, pages 10–18, 2015.
  • [43] Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge Belongie. Learning from noisy large-scale datasets with minimal supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 839–847, 2017.
  • [44] Xinshao Wang, Yang Hua, Elyor Kodirov, and Neil M Robertson. Imae for noise-robust learning: Mean absolute error does not treat examples equally and gradient magnitude’s variance matters. arXiv preprint arXiv:1903.12141, 2019.
  • [45] Yuting Wei, Fanny Yang, and Martin J Wainwright. Early stopping for kernel boosting algorithms: A general analysis with localized complexities. In Advances in Neural Information Processing Systems, pages 6065–6075, 2017.
  • [46] Xiaobo Xia, Tongliang Liu, Nannan Wang, Bo Han, Chen Gong, Gang Niu, and Masashi Sugiyama. Are anchor points really indispensable in label-noise learning? In Advances in Neural Information Processing Systems, pages 6838–6849, 2019.
  • [47] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. In Advances in Neural Information Processing Systems, pages 6225–6236, 2019.
  • [48] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [49] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
  • [50] Han Zhang, Honglak Lee, Sercan Arik, Tomas Pfister, and Zizhao Zhang. Distilling effective supervision from severe label noise. 2020.
  • [51] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • [52] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in neural information processing systems, pages 8778–8788, 2018.

Appendix

Our main contribution is to show that for any dataset, clean data points cluster together in the gradient space and hence medoids of the gradients (1) have clean labels, and (2) provide a low-rank approximation of the Jacobian, 𝒥\mathcal{J}, of an arbitrary deep network. Hence, training on the medoids is robust to noisy labels. Our analysis of the residuals during gradient descent builds on the analysis of [23], but generalize it to arbitrary deep networks without the need for over-parameterization, and random initialization. Crucially, [23] relies on the following assumptions to show that gradient descent with early stopping is robust to label noise, with a high probability: (1) data 𝑿n×d\bm{X}\!\!\subset\!\mathbb{R}^{n\times d} has KK clusters, (2) neural net ff has one hidden layer with kk neurons, i.e., f=ϕ(𝑿𝑾T)𝝂f\!\!=\!\!\phi(\bm{XW}^{T})\bm{\nu}, (3) output weights 𝝂\bm{\nu} are fixed to half +1/k+1/\sqrt{k}, and half 1/k-1/\sqrt{k}, (4) network is over-parameterized, i.e., kK4k\geq K^{4}, where K=𝒪(n)K\!\!=\!\mathcal{O}(n), (5) the input-to-hidden weights 𝑾0\bm{W}^{0} have random Gaussian initialization. Indeed, from the clusterable data assumption it easily follows that the neural network covariance Σ(𝑿)=𝔼𝑾[(ϕ(𝑿𝑾T)ϕ(𝑾T𝑿))(𝑿𝑿𝑻)]=1k𝔼𝑾0[𝒥(𝑾0)𝒥T(𝑾0)]\Sigma(\bm{X})=\mathbb{E}_{\bm{W}}\big{[}\big{(}\phi^{\prime}(\bm{XW}^{T})\phi^{\prime}(\bm{W}^{T}\bm{X})\big{)}\odot\big{(}\bm{XX^{T}}\big{)}\big{]}\!=\!\frac{1}{k}\mathbb{E}_{\bm{W}^{0}}\big{[}\mathcal{J}(\bm{W}^{0})\mathcal{J}^{T}(\bm{W}^{0})] is low-rank, and hence early stopping prevents overfitting noisy labels. In contrast, our results holds for arbitrary deep networks without relying on the above assumptions.

Appendix A Proofs for Theorems

The following Corollary builds upon the meta Theorem 7.2 from [23] and captures the contraction of the residuals during gradient descent updates on a dataset with corrupted labels, when the Jacobian is exactly low-rank. The original theorem in [23] is based on the assumptions that the fraction of corrupted labels in the dataset is small, and the dataset is clusterable. Hence 𝒮+\mathcal{S}_{+} is dictated by the membership of data points to different clusters. In contrast, our method Crust applies gradient descent only to the selected subsets. Note that the elements of the selected subsets are mostly clean, and hence the extracted subsets have a significantly smaller fraction of noisy labels. The elements selected earlier by Crust are medoids of the main clusters in the gradient space. As we keep selecting new data points, we extract medoids of the smaller groups of data points within the main clusters. Therefore, in our method the support subspace 𝒮+\mathcal{S}_{+} is defined by the assignment of the elements of the extracted subsets to the main clusters.

More specifically, assume that there are K<kK<k main clusters in the gradient space. We denote the set of central elements of the main clusters by S¯\bar{S}. For a subset SVS\subseteq V of size kk selected by Crust and upper-bounds dud^{u} on gradient dissimilarities from Eq. (9), let Λ={i[k]|=argminsS¯disu}\Lambda_{\ell}=\{i\in[k]|\ell={\arg\min}_{s\in\bar{S}}d^{u}_{is}\} be the set of elements in SS that are closest to an element S¯\ell\in\bar{S}, i.e., they lie within the main cluster ll in the gradient space. Then, 𝒮+\mathcal{S}_{+} is characterized by

𝒮+={𝒗k|𝒗i1=𝒗i2for alli1,i2Λand for all1K}.\mathcal{S}_{+}=\{\bm{v}\in\mathbb{R}^{k}{~{}\big{|}~{}}\bm{v}_{i_{1}}=\bm{v}_{i_{2}}\quad\text{for all}\quad i_{1},i_{2}\in\Lambda_{\ell}\quad\text{and for all}~{}1\leq\ell\leq K\}. (12)

The following corollary captures the contraction of the residuals during gradient descent on subsets found by Crust, when the Jacobian is low-rank. In Theorem 4.1, we characterize the contraction of the residuals, when the Jacobian is approximately low-rank, i.e., over 𝒮\mathcal{S}_{-} the spectral norm is small but nonzero.

Corollary A.1

Consider a nonlinear least squares problem of the form (𝐖,𝐗)=12f(𝐖,𝐗)𝐲)22{\cal{L}}(\bm{W},{\bm{X}})=\frac{1}{2}\left\|f(\bm{\bm{W},{\bm{X}}})-\bm{y})\right\|_{\ell_{2}}^{2}. Assume that the weighted Jacobian of the subset SS found by Crust  is low-rank, i.e., for all 𝐯𝒮+\bm{v}\in\mathcal{S}_{+} and 𝐰𝒮\bm{w}\in\mathcal{S}_{-} with unit Euclidean norm, and βα>0\beta\geq\alpha>0 we have that α𝒥rT(𝐖,XS)𝐯2βand𝒥rT(𝐖,XS)𝐰2=0.\alpha\leq\left\|{\cal{J}}_{r}^{T}(\bm{W},X_{S})\bm{v}\right\|_{\ell_{2}}\leq\beta\quad\!\!\!\text{and}\!\!\!\quad\left\|{\cal{J}}_{r}^{T}(\bm{W},X_{S})\bm{w}\right\|_{\ell_{2}}\!=0. Moreover, assume that the Jacobian mapping 𝒥(𝐖,𝐗𝐒)\mathcal{J}(\bm{\bm{W},X_{S}}) associated to the nonlinear mapping ff is LL-smooth, i.e., for all 𝐖1,𝐖2m\bm{\bm{W}}^{1},\bm{\bm{W}}^{2}\in\mathbb{R}^{m} we have 𝒥(𝐖2)𝒥(𝐖1)L𝐖2𝐖12\left\|\mathcal{J}(\bm{\bm{W}}^{2})-\mathcal{J}(\bm{\bm{W}}^{1})\right\|\leq L\left\|\bm{\bm{W}}^{2}-\bm{\bm{W}}^{1}\right\|_{\ell_{2}}.333Note that, if 𝒥(𝐖)𝐖\frac{\partial{\cal{J}}(\bm{W})}{\partial\bm{W}} is continuous, the smoothness condition holds over any compact domain (albeit for a possibly large LL). Also let 𝐲s,𝐲~S,𝐞=𝐲S𝐲~Sk\bm{y}_{s},\widetilde{\bm{y}}_{S},\bm{e}=\bm{y}_{S}-\widetilde{\bm{y}}_{S}\in\mathbb{R}^{k} denote the corrupted and uncorrupted labels associated with the selected subset, and label corruption, respectively. Furthermore, suppose the initial residual f(𝐖0,XS)𝐲~Sf(\bm{W}^{0},X_{S})-\widetilde{\bm{y}}_{S} with respect to the uncorrupted labels obey f(𝐖0,XS)𝐲~S𝒮+f(\bm{W}^{0},X_{S})-\widetilde{\bm{y}}_{S}\in\mathcal{S}_{+}. Then, iterates of gradient descent updates of the from (2) with a learning rate η12β2min(1,αβL𝐫02)\eta\leq\frac{1}{2\beta^{2}}\min\left(1,\frac{\alpha\beta}{L\left\|\bm{r}^{0}\right\|_{\ell_{2}}}\right), obey

𝑾τ𝑾024𝒓02α=4f(𝑾0,XS)𝒚S2α.\|{\bm{W}^{\tau}-\bm{W}^{0}}\|_{\ell_{2}}\leq\frac{4\|{\bm{r}^{0}}\|_{\ell_{2}}}{\alpha}=\frac{4\|{f(\bm{W}^{0},X_{S})-\bm{y}_{S}}\|_{\ell_{2}}}{\alpha}.

Furthermore, if λ>0\lambda>0 is a precision level obeying λΠ𝒮+(𝐞)\lambda\geq\|{\Pi_{\mathcal{S}_{+}}(\bm{e})}\|_{\ell_{\infty}}, then running gradient descent updates of the from (2) with a learning rate η12β2min(1,αβL𝐫02)\eta\leq\frac{1}{2\beta^{2}}\min\left(1,\frac{\alpha\beta}{L\left\|\bm{r}^{0}\right\|_{\ell_{2}}}\right), after τ5ηα2log(𝐫02λ)\tau\geq\frac{5}{\eta\alpha^{2}}\log\left(\frac{\|{\bm{r}^{0}}\|_{\ell_{2}}}{\lambda}\right) iterations, 𝐖τ\bm{W}^{\tau} achieves the following error bound with respect to the true labels

f(𝑾τ,XS)𝒚~S2λ.\|{f(\bm{W}^{\tau},X_{S})-\widetilde{\bm{y}}_{S}}\|_{\ell_{\infty}}\leq 2\lambda.

Finally, if 𝐞\bm{e} has at most ss nonzeros and 𝒮+\mathcal{S}_{+} is γ\gamma-diffused i.e., for any vector 𝐯𝒮+\bm{v}\in\mathcal{S}_{+} we have 𝐯γ/n𝐯2\|{\bm{v}}\|_{\ell_{\infty}}\leq\sqrt{\gamma/n}\|{\bm{v}}\|_{\ell_{2}} for some γ>0\gamma>0, then using λ=Π𝒮+(𝐞)\lambda=\|{\Pi_{\mathcal{S}_{+}}(\bm{e})}\|_{\ell_{\infty}}, we get

f(𝑾τ,XS)𝒚~S2Π𝒮+(𝒆)γsn𝒆2.\|{f(\bm{W}^{\tau},X_{S})-\widetilde{\bm{y}}_{S}}\|_{\ell_{\infty}}\leq 2\|{\Pi_{\mathcal{S}_{+}}(\bm{e})}\|_{\ell_{\infty}}\leq\frac{\gamma\sqrt{s}}{n}\|{\bm{e}}\|_{\ell_{2}}.
Lemma A.2

Let {(𝐱i,yi)}iS\{(\bm{x}_{i},y_{i})\}_{i\in S} be the subset selected by Crust, and {y~i}iS\{\tilde{y}_{i}\}_{i\in S} be the corresponding noiseless labels. Note that the elements of the selected subsets are mostly clean, but may contain a smaller fraction ρ\rho of noisy labels. Let 𝒥(𝐖,𝐗S){\cal{J}}(\bm{W},{\bm{X}}_{S}) be the Jacobian matrix corresponding to the selected subset which is rank kk, and 𝒮+\mathcal{S}_{+} be its column space. Then, the difference between noiseless and noisy labels satisfy the bound

Π𝒮+(𝒚S𝒚~S)2ρ.\|{\Pi_{\mathcal{S}_{+}}(\bm{y}_{S}-\tilde{\bm{y}}_{S})}\|_{\ell_{\infty}}\leq 2\rho.

Proof The proof is similar to that of Lemma 8.10 in [23], but using 𝒮+\mathcal{S}_{+} as defined in Eq. (12).  

A.1 Proof of Theorem 4.1

We first consider the case where the set SVS\subseteq V of kk weighted medoids found by Crust approximates the largest singular value of the neural network Jacobian by an error of ϵ=0\epsilon=0. Therefore, we can apply Corollary A.3 to characterize the behavior of gradient descent on the weighted subsets found by Crust. The following Corollary summarizes the results:

Corollary A.3

Assume that we apply gradient descent on the least-squares loss in Eq. (2) to train a neural network on subsets found by Crust from a dataset with class labels {νj}j=1C[1,1]\{\nu_{j}\}_{j=1}^{C}\in[-1,1], label margin δ\delta. Suppose that the Jacobian mapping is LL-smooth, and let α=minτrminτσmin(𝒥(𝐖τ,𝐗Sτ))\alpha=\min_{\tau}\sqrt{r_{\min}^{\tau}}\sigma_{\min}({\cal{J}}(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})), and β=maxτrmaxτ𝒥(𝐖τ,𝐗Sτ)\beta=\max_{\tau}\sqrt{r_{\max}^{\tau}}\|{\cal{J}}(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})\| be the minimum and maximum singular values of the Jacobian of the subsets weighted by 𝐫τ\bm{r}^{\tau} found by Crust, during τ\tau steps of gradient descent updates. If subsets contain a fraction of ρδ/8\rho\leq\delta/8 noisy labels, using step size η=12β2min(1,αβL2k)\eta=\frac{1}{2\beta^{2}}\min(1,\frac{\alpha\beta}{L\sqrt{2k}}), after τ=𝒪(1ηα2log(kρ))\tau={\cal{O}}({\frac{1}{\eta\alpha^{2}}}\log(\frac{\sqrt{k}}{\rho})) iterations, the neural network classifies all the selected elements correctly.

Proof Fix a vector 𝒗\bm{v} and let 𝒑~=𝒥r(𝑾,𝑿S)𝒗\tilde{{\bm{p}}}={\cal{J}}_{r}(\bm{W},{\bm{X}}_{S})\bm{v} and 𝒑=𝒥(𝑾,𝑿S)𝒗{{\bm{p}}}={\cal{J}}(\bm{W},{\bm{X}}_{S})\bm{v}. Entries of 𝒑~\tilde{{\bm{p}}} multiply the entries of 𝒑{\bm{p}} somewhere between rminr_{\min} and rmaxr_{\max}. This establishes the upper and lower bounds on the singular values of 𝒥r(𝑾,𝑿S){\cal{J}}_{r}(\bm{W},{\bm{X}}_{S}) over 𝒮+\mathcal{S}_{+} in terms of the singular values of 𝒥(𝑾,𝑿S){\cal{J}}(\bm{W},{\bm{X}}_{S}). I.e.,

rminσmin(𝒥(𝑾,𝑿S),𝒮+)σi[k](𝒥r(𝑾,𝑿S),𝒮+)rmax𝒥(𝑾,𝑿S).\sqrt{r_{\min}}\sigma_{\min}({\cal{J}}(\bm{W},{{\bm{X}}}_{S^{*}}),\mathcal{S}_{+})\leq\sigma_{i\in[k]}({\cal{J}}_{r}(\bm{W},{{\bm{X}}_{S^{*}}}),\mathcal{S}_{+})\leq\sqrt{r_{\max}}\|{\cal{J}}(\bm{W},{{\bm{X}}}_{S^{*}})\|. (13)

Therefore, we get that α=minτrminτσmin(𝒥(𝑾τ,𝑿Sτ)),β=maxτrmaxτ𝒥(𝑾τ,𝑿Sτ)\alpha=\min_{\tau}\sqrt{r_{\min}^{\tau}}\sigma_{\min}({\cal{J}}(\bm{W}^{\tau},{\bm{X}}^{\tau}_{S})),\beta=\max_{\tau}\sqrt{r_{\max}^{\tau}}\|{\cal{J}}(\bm{W}^{\tau},{\bm{X}}^{\tau}_{S})\|.

Moreover, we have f:d[1,1]f:\mathbb{R}^{d}\rightarrow[-1,1], and class labels {νj}j=1C[1,1]\{\nu_{j}\}_{j=1}^{C}\in[-1,1]. Hence, for every element iSi\in S, we have |f(𝑾,𝒙i)yi|2|f(\bm{W},\bm{x}_{i})-y_{i}|\leq 2, and the upper-bound on the initial misfit is 𝒓0l22k\|\bm{r}^{0}\|_{l_{2}}\leq\sqrt{2k}.

Now, using Lemma A.2, we know that

Π𝒮+(𝒚𝒚~)2ρ.\|{\Pi_{\mathcal{S}_{+}}(\bm{y}-\tilde{\bm{y}})}\|_{\ell_{\infty}}\leq 2\rho.

Substituting the values corresponding to α,β\alpha,\beta in Theorem 4.1, we get that after

5ηα2log(𝒓022ρ)5ηα2log(2k2ρ)τ\frac{5}{\eta\alpha^{2}}\log(\frac{\|{\bm{r}^{0}}\|_{\ell_{2}}}{2\rho})\leq\frac{5}{\eta\alpha^{2}}\log(\frac{\sqrt{2k}}{2\rho})\leq\tau (14)

gradient descent iterations, the error bound with respect to the true labels is f(𝑾τ,XS)𝒚~S2ρ\|{f(\bm{W}_{\tau},X_{S})-\widetilde{\bm{y}}_{S}}\|_{\ell_{\infty}}\leq 2\rho. If gradient clusters are roughly balanced, i.e., there are 𝒪(k/K){\cal{O}}(k/K^{\prime}) data points in each cluster, we get that for all gradient iterations with

5ηα2log(𝒓022ρ)5ηα2log(2k2ρ)=𝒪(Kηkσmin2𝒥(𝑾,𝑿S)log(kρ))τ,\frac{5}{\eta\alpha^{2}}\log(\frac{\|{\bm{r}^{0}}\|_{\ell_{2}}}{2\rho})\leq\frac{5}{\eta\alpha^{2}}\log(\frac{\sqrt{2k}}{2\rho})={\cal{O}}({\frac{K}{\eta k\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho}))\leq\tau, (15)

the infinity norm of the residual obeys (using λ=Π𝒮+(𝒆)2ρ\lambda=\|{\Pi_{\mathcal{S}_{+}}(\bm{e})}\|_{\ell_{\infty}}\leq 2\rho)

f(𝑾,XS)𝒚~4ρ.\|{f(\bm{W},X_{S})-\tilde{\bm{y}}}\|_{\ell_{\infty}}\leq 4\rho.

This implies that if ρδ/8\rho\leq\delta/8, the labels predicted by the network are away from the correct labels by less than δ/2\delta/2, hence the elements of the subsets (including noisy ones) will be classified correctly.  

A.2 Completing the Proof of Theorem 4.1

Next, we consider the case where weighted subsets found by Crust approximate the prominent singular value of the Jacobian matrix by an error of at most ϵ\epsilon. Here, we characterize the behavior of gradient descent by comparing the iterations with and without error.

In particular, starting from 𝑾0=𝑾¯0\bm{W}^{0}=\bar{\bm{W}}^{0} consider the gradient descent iterations on the weighted subsets S¯\bar{S} that estimating the largest singular value of the neural network Jacobian without an error,

𝑾¯τ+1=𝑾¯τη𝒥rT(𝑾¯τ,𝑿S¯τ)(f(𝑾¯τ,𝑿)𝒚S¯τ),\displaystyle\bar{\bm{W}}^{\tau+1}=\bar{\bm{W}}^{\tau}-\eta{\cal{J}}_{r}^{T}(\bar{\bm{W}}^{\tau},{\bm{X}}_{\bar{S}^{\tau}})(f(\bar{\bm{W}}^{\tau},{\bm{X}})-{\bm{y}_{\bar{S}^{\tau}}}), (16)

and gradient descent iterations on the weighted subsets SS with an error of at most ϵ\epsilon in estimating the largest singular value of the neural network Jacobian,

𝑾τ+1=𝑾τη𝒥rT(𝑾τ,𝑿Sτ)(f(𝑾τ,𝑿Sτ)𝒚Sτ).\displaystyle{\bm{W}}^{\tau+1}={\bm{W}}^{\tau}-\eta{\cal{J}}_{r}^{T}({\bm{W}^{\tau}},{\bm{X}}_{S^{\tau}})(f({\bm{W}}^{\tau},{\bm{X}}_{S^{\tau}})-{\bm{y}}_{S^{\tau}}). (17)

To proceed with the proof, we use the following short hand notations for residuals and Jacobian matrix in iteration τ\tau of gradient descent:

𝒓τ=f(𝑾τ,𝑿Sτ)𝒚Sτ,𝒓¯τ=f(𝑾¯τ,𝑿S¯τ)𝒚S¯τ\displaystyle\bm{r}^{\tau}=f(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})-\bm{y}_{S^{\tau}},~{}\bar{\bm{r}}^{\tau}=f(\bar{\bm{W}}^{\tau},{\bm{X}}_{\bar{S}^{\tau}})-\bm{y}_{\bar{S}^{\tau}} (18)
𝒥τ=𝒥r(𝑾τ,𝑿Sτ),𝒥τ+1,τ=𝒥r(𝑾τ+1,𝑾τ,𝑿Sτ),\displaystyle{\cal{J}}^{\tau}={\cal{J}}_{r}(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}}),~{}{\cal{J}}^{\tau+1,\tau}={\cal{J}}_{r}(\bm{W}^{\tau+1},\bm{W}^{\tau},{\bm{X}}_{S^{\tau}}), (19)
𝒥¯τ=𝒥r(𝑾¯τ,𝑿Sτ),𝒥¯τ+1,τ=𝒥r(𝑾¯τ+1,𝑾¯τ,𝑿S¯τ)\displaystyle~{}\bar{{\cal{J}}}^{\tau}={\cal{J}}_{r}(\bar{\bm{W}}^{\tau},{{\bm{X}}}_{S^{\tau}}),~{}\bar{{\cal{J}}}^{\tau+1,\tau}={\cal{J}}_{r}(\bar{\bm{W}}^{\tau+1},\bar{\bm{W}}^{\tau},{{\bm{X}}}_{\bar{S}^{\tau}}) (20)
dτ=𝑾τ𝑾¯τF,pτ=𝒓τ𝒓¯τF,\displaystyle d^{\tau}=\|{\bm{W}^{\tau}-\bar{\bm{W}}^{\tau}}\|_{F},~{}p^{\tau}=\|{\bm{r}^{\tau}-\bar{\bm{r}}^{\tau}}\|_{F}, (21)

where 𝒥r(𝑾1,𝑾2,𝑿S){\cal{J}}_{r}(\bm{W}^{1},\bm{W}^{2},{\bm{X}}_{S}) denotes the average weighted neural network Jacobian at subset 𝑿S{\bm{X}}_{S}, i.e.,

𝒥r(𝑾1,𝑾2,𝑿S)=01𝒥r(α𝑾1+(1α)𝑾2,𝑿S)𝑑α.{\cal{J}}_{r}(\bm{W}^{1},\bm{W}^{2},{\bm{X}}_{S})=\int_{0}^{1}{\cal{J}}_{r}(\alpha\bm{W}^{1}+(1-\alpha)\bm{W}^{2},{\bm{X}}_{S})d\alpha.

We first proof the following Lemma that bounds the normed difference between 𝒥r(𝑾¯1,𝑾¯2,𝑿S¯){{\cal{J}}}_{r}(\bar{\bm{W}}^{1},\bar{\bm{W}}^{2},{\bm{X}}_{\bar{S}}) and 𝒥r(𝑾1,𝑾2,𝑿S){\cal{J}}_{r}(\bm{W}^{1},\bm{W}^{2},{\bm{X}}_{S}).

Lemma A.4

Let 𝐗S,𝐗S¯{{\bm{X}}_{S}},{{\bm{X}}_{\bar{S}}} be the subset of data points found by Crust, that approximates the Jacobian matrix on the entire data by and error of 0 and ϵ\epsilon, respectively. Given parameters 𝐖1,𝐖2,𝐖¯1,𝐖¯2\bm{W}^{1},\bm{W}^{2},\bar{\bm{W}}^{1},\bar{\bm{W}}^{2}, we have that

𝒥r(𝑾1,𝑾2,𝑿S)𝒥r(𝑾¯1,𝑾¯2,𝑿S¯)(𝑾¯1𝑾1F+𝑾¯2𝑾2F2+ϵ).\|{\cal{J}}_{r}(\bm{W}^{1},\bm{W}^{2},{\bm{X}}_{S})-{\cal{J}}_{r}(\bar{\bm{W}}^{1},\bar{\bm{W}}^{2},{{\bm{X}}_{\bar{S}}})\|\leq(\frac{\|{\bar{\bm{W}}^{1}-\bm{W}^{1}}\|_{F}+\|{\bar{\bm{W}}^{2}-\bm{W}^{2}}\|_{F}}{2}+\epsilon).

Proof Given 𝑾,𝑾¯\bm{W},\bar{\bm{W}}, we can write

𝒥r(𝑾,𝑿S)𝒥r(𝑾¯,𝑿S¯)\displaystyle\|{\cal{J}}_{r}(\bm{W},{\bm{X}}_{S})-{\cal{J}}_{r}(\bar{\bm{W}},{{\bm{X}}_{\bar{S}}})\| 𝒥r(𝑾,𝑿S)𝒥r(𝑾¯,𝑿S)+𝒥r(𝑾¯,𝑿S¯)𝒥r(𝑾¯,𝑿S)\displaystyle\leq\|{\cal{J}}_{r}(\bm{W},{\bm{X}}_{S})-{\cal{J}}_{r}(\bar{\bm{W}},{{\bm{X}}_{S}})\|+\|{\cal{J}}_{r}(\bar{\bm{W}},{\bm{X}}_{\bar{S}})-{\cal{J}}_{r}(\bar{\bm{W}},{{\bm{X}}_{S}})\| (22)
L𝑾𝑾¯+ϵ.\displaystyle\leq L\|\bm{W}-\bar{\bm{W}}\|+\epsilon. (23)

To get the result on 𝒥(𝑾1,𝑾2,𝑿S)𝒥r(𝑾¯1,𝑾¯2,𝑿S¯)\|{\cal{J}}(\bm{W}^{1},\bm{W}^{2},{\bm{X}}_{S})-{\cal{J}}_{r}(\bar{\bm{W}}^{1},\bar{\bm{W}}^{2},{{\bm{X}}_{\bar{S}}})\|, we integrate

𝒥(𝑾1,𝑾2,𝑿S)𝒥r(𝑾¯1,𝑾¯2,𝑿S¯)\displaystyle\|{\cal{J}}(\bm{W}^{1},\bm{W}^{2},{\bm{X}}_{S})-{\cal{J}}_{r}(\bar{\bm{W}}^{1},\bar{\bm{W}}^{2},{{\bm{X}}_{\bar{S}}})\| 01(Lα(𝑾¯1𝑾1)+(1α)(𝑾¯1𝑾1)F+ϵ)𝑑α\displaystyle\leq\int_{0}^{1}(L\|{\alpha(\bar{\bm{W}}^{1}-\bm{W}^{1})+(1-\alpha)(\bar{\bm{W}}^{1}-\bm{W}^{1})}\|_{F}+\epsilon)d\alpha (24)
L(𝑾¯1𝑾1F+𝑾¯2𝑾2F)2+ϵ.\displaystyle\leq\frac{L(\|{\bar{\bm{W}}^{1}-\bm{W}^{1}}\|_{F}+\|{\bar{\bm{W}}^{2}-\bm{W}^{2}}\|_{F})}{2}+\epsilon. (25)

 

Now, applying Lemma A.4, we have

𝒥r(𝑾τ,𝑿Sτ)𝒥r(𝑾¯τ,𝑿S¯τ)L𝑾¯τ𝑾τF+ϵLdτ+ϵ\displaystyle\|{\cal{J}}_{r}(\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})-{\cal{J}}_{r}(\bar{\bm{W}}^{\tau},{{\bm{X}}}_{\bar{S}^{\tau}})\|\leq L{\|{\bar{\bm{W}}^{\tau}-\bm{W}^{\tau}}\|_{F}}+\epsilon\leq Ld^{\tau}+\epsilon (26)
𝒥r(𝑾τ+1,𝑾τ,𝑿Sτ)𝒥r(𝑾¯τ+1,𝑾¯τ,𝑿S¯τ)L(dτ+dτ+1)/2+ϵ.\displaystyle\|{\cal{J}}_{r}(\bm{W}^{\tau+1},\bm{W}^{\tau},{\bm{X}}_{S^{\tau}})-{\cal{J}}_{r}(\bar{\bm{W}}^{\tau+1},\bar{\bm{W}}^{\tau},{{\bm{X}}_{\bar{S}^{\tau}}})\|\leq L(d^{\tau}+d^{\tau+1})/2+\epsilon. (27)

Following this and since the normed noiseless residual is non-increasing and satisfies 𝒓¯τ2𝒓¯02\|{\bar{\bm{r}}_{\tau}}\|_{\ell_{2}}\leq\|{\bar{\bm{r}}^{0}}\|_{\ell_{2}}, we can write

𝑾τ+1=𝑾τη𝒥τ𝒓τ,𝑾¯τ+1=𝑾¯τη(𝒥¯τ)T𝒓¯τ\displaystyle\bm{W}^{\tau+1}=\bm{W}^{\tau}-\eta{\cal{J}}^{\tau}\bm{r}^{\tau}\quad,\quad\bar{\bm{W}}^{\tau+1}=\bar{\bm{W}}^{\tau}-\eta(\bar{{\cal{J}}}^{\tau})^{T}\bar{\bm{r}}^{\tau} (28)
𝑾τ+1𝑾¯τ+1F𝑾τ𝑾¯τF+η𝒥τ𝒥¯τ𝒓¯τ2+η𝒥τ𝒓τ𝒓¯τ2,\displaystyle\|{\bm{W}^{\tau+1}-\bar{\bm{W}}^{\tau+1}}\|_{F}\leq\|{\bm{W}^{\tau}-\bar{\bm{W}}^{\tau}}\|_{F}+\eta\|{\cal{J}}^{\tau}-\bar{{\cal{J}}}^{\tau}\|\|{\bar{\bm{r}}^{\tau}}\|_{\ell_{2}}+\eta\|{{\cal{J}}}^{\tau}\|\|{\bm{r}^{\tau}-\bar{\bm{r}}^{\tau}}\|_{\ell_{2}}, (29)
dτ+1dτ+η((Ldτ+ϵ)𝒓¯02+βpτ).\displaystyle d^{\tau+1}\leq d^{\tau}+\eta\big{(}(Ld^{\tau}+\epsilon\big{)}\|{\bar{\bm{r}}^{0}}\|_{\ell_{2}}+\beta p^{\tau}). (30)

For the residual we have

𝒓τ+1\displaystyle\bm{r}^{\tau+1} =𝒓τf(𝑾τ,𝑿S)+f(𝑾τ+1,𝑿S)\displaystyle=\bm{r}^{\tau}-f(\bm{W}^{\tau},{\bm{X}}_{S})+f({\bm{W}}^{\tau+1},{\bm{X}}_{S}) (31)
=𝒓τ+𝒥τ+1,τ(𝑾τ+1𝑾τ)\displaystyle=\bm{r}^{\tau}+{\cal{J}}^{\tau+1,\tau}({\bm{W}}^{\tau+1}-\bm{W}^{\tau}) (32)
=𝒓τη𝒥τ+1,τ(𝒥τ)T𝒓τ,\displaystyle=\bm{r}^{\tau}-\eta{\cal{J}}^{\tau+1,\tau}({\cal{J}}^{\tau})^{T}\bm{r}^{\tau}, (33)

where in Eq. (33) we used 𝑾τ+1𝑾τ=η(𝑾τ)=η(𝒥τ)T𝒓τ{\bm{W}}^{\tau+1}-\bm{W}^{\tau}=\eta{\nabla{\cal{L}}(\bm{W}^{\tau})}=\eta({\cal{J}}^{\tau})^{T}\bm{r}^{\tau}. Furthermore, we can write

𝒓τ+1𝒓¯τ+1\displaystyle\bm{r}^{\tau+1}-\bar{\bm{r}}^{\tau+1} =(𝒓τ𝒓¯τ)η(𝒥τ+1,τ𝒥¯τ+1,τ)(𝒥τ)T𝒓τ\displaystyle=(\bm{r}^{\tau}-\bar{\bm{r}}^{\tau})-\eta({\cal{J}}^{\tau+1,\tau}-\bar{{\cal{J}}}^{\tau+1,\tau})({\cal{J}}^{\tau})^{T}\bm{r}^{\tau} (34)
η𝒥¯τ+1,τ((𝒥τ)T(𝒥¯τ)T)𝒓τη𝒥¯τ+1,τ(𝒥¯τ)T(𝒓τ𝒓¯τ)\displaystyle\quad-\eta\bar{{\cal{J}}}_{\tau+1,\tau}\big{(}({\cal{J}}^{\tau})^{T}-(\bar{{\cal{J}}}^{\tau})^{T}\big{)}\bm{r}^{\tau}-\eta\bar{{\cal{J}}}^{\tau+1,\tau}(\bar{{\cal{J}}}^{\tau})^{T}(\bm{r}^{\tau}-\bar{\bm{r}}^{\tau}) (35)
=(𝑰η𝒥¯τ+1,τ(𝒥¯τ)T)(𝒓τ𝒓¯τ)η(𝒥τ+1,τ𝒥¯τ+1,τ)(𝒥τ)T𝒓τ\displaystyle=\big{(}{\bm{I}}-\eta\bar{{\cal{J}}}^{\tau+1,\tau}(\bar{{\cal{J}}}^{\tau})^{T}\big{)}(\bm{r}^{\tau}-\bar{\bm{r}}^{\tau})-\eta({\cal{J}}^{\tau+1,\tau}-\bar{{\cal{J}}}^{\tau+1,\tau})({\cal{J}}^{\tau})^{T}\bm{r}^{\tau} (36)
η𝒥¯τ+1,τ((𝒥τ)T(𝒥¯τ)T)𝒓τ.\displaystyle\quad-\eta\bar{{\cal{J}}}^{\tau+1,\tau}\big{(}({\cal{J}}^{\tau})^{T}-(\bar{{\cal{J}}}^{\tau})^{T}\big{)}\bm{r}^{\tau}. (37)

Using 𝑰𝒥¯τ+1,τ(𝒥¯τ)T/β20{\bm{I}}\succeq\bar{{\cal{J}}}^{\tau+1,\tau}(\bar{{\cal{J}}}^{\tau})^{T}/\beta^{2}\succeq 0, we have

𝒓τ+1𝒓¯τ+12\displaystyle\|{\bm{r}^{\tau+1}-\bar{\bm{r}}^{\tau+1}}\|_{\ell_{2}} 𝒓τ𝒓¯τ2+ηβ𝒓τ2(L(3dτ+dτ+1)/2+2ϵ)\displaystyle\leq\|{\bm{r}^{\tau}-\bar{\bm{r}}^{\tau}}\|_{\ell_{2}}+\eta\beta\|{\bm{r}^{\tau}}\|_{\ell_{2}}(L(3d^{\tau}+d^{\tau+1})/2+2\epsilon) (38)
𝒓τ𝒓¯τ2+ηβ(𝒓¯02+pτ)(L(3dτ+dτ+1)/2+2ϵ),\displaystyle\leq\|{\bm{r}^{\tau}-\bar{\bm{r}}^{\tau}}\|_{\ell_{2}}+\eta\beta(\|{\bar{\bm{r}}^{0}}\|_{\ell_{2}}+p^{\tau})(L(3d^{\tau}+d^{\tau+1})/2+2\epsilon), (39)

where we used 𝒓τ2pτ+𝒓¯02\|{\bm{r}^{\tau}}\|_{\ell_{2}}\leq p^{\tau}+\|{\bar{\bm{r}}^{0}}\|_{\ell_{2}} and (𝑰η𝒥¯τ+1,τ(𝒥¯τ)T)𝒗2𝒗2\|{({\bm{I}}-\eta\bar{{\cal{J}}}^{\tau+1,\tau}(\bar{{\cal{J}}}^{\tau})^{T})\bm{v}}\|_{\ell_{2}}\leq\|{\bm{v}}\|_{\ell_{2}} which follows from the contraction of the residual. This implies that

pτ+1pτ+ηβ(𝒓¯02+pτ)(L(3dτ+dτ+1)/2+2ϵ).\displaystyle p^{\tau+1}\leq p^{\tau}+\eta\beta(\|{\bar{\bm{r}}^{0}}\|_{\ell_{2}}+p^{\tau})(L(3d^{\tau}+d^{\tau+1})/2+2\epsilon). (40)

We use a similar inductive argument as that of Theorem 8.10 in [23]. The claim is that if for all tτ0t\leq\tau_{0}, we have (using 𝒓¯02Θ\|{\bar{\bm{r}}^{0}}\|_{\ell_{2}}\leq\Theta)

ϵ𝒪(α2βlog(k/ρ))𝒪(kσmin2(𝒥(𝑾,𝑿S))Kβlog(k/ρ)),andL25τ0ηΘ(1+8ητ0β2)),\displaystyle\epsilon\leq{\cal{O}}(\frac{\alpha^{2}}{\beta\log(\sqrt{k}/\rho)})\leq{\cal{O}}(\frac{{k}\sigma^{2}_{\min}({\cal{J}}(\bm{W},{\bm{X}}_{S}))}{{K\beta}\log(\sqrt{k}/\rho)}),\quad\text{and}\quad L\leq\frac{2}{5\tau_{0}\eta\Theta(1+8\eta\tau_{0}\beta^{2}))}, (41)

then it follows that

pt8tηϵΘβ,dt2tηϵΘ(1+8ητ0β2)20tη2τ0ϵΘβ2𝒪(tη2τ0k3/2ϵ).p^{t}\leq 8t\eta\epsilon\Theta\beta,\quad\quad d^{t}\leq 2t\eta\epsilon\Theta(1+8\eta\tau_{0}\beta^{2})\leq 20t\eta^{2}\tau_{0}\epsilon\Theta\beta^{2}\leq{\cal{O}}(t\eta^{2}\tau_{0}k^{3/2}\epsilon). (42)

The proof is by induction. Suppose for tτ01t\leq\tau_{0}-1, we have that

pt8tηϵΘβΘ,dt2tηϵΘ(1+8ητ0β2).p^{t}\leq 8t\eta\epsilon\Theta\beta\leq\Theta,\quad\quad d^{t}\leq 2t\eta\epsilon\Theta(1+8\eta\tau_{0}\beta^{2}). (43)

At t+1t+1, from (30) we know that

dt+1dtηLdtΘ+ϵΘ+8τ0ηβ2ϵΘ\displaystyle\frac{d^{t+1}-d^{t}}{\eta}\leq Ld^{t}\Theta+\epsilon\Theta+8\tau_{0}\eta\beta^{2}\epsilon\Theta (44)

Now, using L25τ0ηΘ(1+8ητ0β2))12ητ0ΘL\leq\frac{2}{5\tau_{0}\eta\Theta(1+8\eta\tau_{0}\beta^{2}))}\leq\frac{1}{2\eta\tau_{0}\Theta} from (41), and replacing dtd^{t} from (43) into (44) we get

dt+1dtηLdtΘ+ϵΘ+8τ0ηβ2ϵΘ?2ϵΘ(1+8ητ0β2).\displaystyle\frac{d^{t+1}-d^{t}}{\eta}\leq Ld^{t}\Theta+\epsilon\Theta+8\tau_{0}\eta\beta^{2}\epsilon\Theta\overset{?}{\leq}2\epsilon\Theta(1+8\eta\tau_{0}\beta^{2}). (45)

This establishes the induction for dt+1d^{t+1}.

To show the induction on ptp^{t}, following (8.64) and using ptΘp^{t}\leq\Theta, we need

pt+1ptηβΘ(L(3dτ+dτ+1)+4ϵ)\displaystyle\frac{p^{t+1}-p^{t}}{\eta}\leq\beta\Theta(L(3d^{\tau}+d^{\tau+1})+4\epsilon) ?8ϵΘβ\displaystyle\overset{?}{\leq}8\epsilon\Theta\beta (46)
L(3dτ+dτ+1)+4ϵ\displaystyle L(3d^{\tau}+d^{\tau+1})+4\epsilon ?8ϵ\displaystyle\overset{?}{\leq}8\epsilon (47)
L(3dτ+dτ+1)\displaystyle L(3d^{\tau}+d^{\tau+1}) ?4ϵ\displaystyle\overset{?}{\leq}4\epsilon (48)
10Lτ0η(1+8ητ0β2)Θ\displaystyle 10L\tau_{0}\eta(1+8\eta\tau_{0}\beta^{2})\Theta ?4,\displaystyle\overset{?}{\leq}4, (49)

where in the last inequality we used 3dt+dt+110τ0ηϵΘ(1+8ητ0β2)3d^{t}+d^{t+1}\leq 10\tau_{0}\eta\epsilon\Theta(1+8\eta\tau_{0}\beta^{2}). Note that η=12β2min(1,αβL2k)\eta=\frac{1}{2\beta^{2}}\min(1,\frac{\alpha\beta}{L\sqrt{2k}}). Hence, if αβL2k1\frac{\alpha\beta}{L\sqrt{2k}}\geq 1, we get that η=12β21τ0β2\eta=\frac{1}{2\beta^{2}}\geq\frac{1}{\tau_{0}\beta^{2}}, and thus ητ0β21\eta\tau_{0}\beta^{2}\geq 1. This allows upper-bounding 3dt+dt+13d^{t}+d^{t+1}. Now, for LL we have

L25τ0ηΘ(1+8ητ0β2)).L{\leq}\frac{2}{5\tau_{0}\eta\Theta(1+8\eta\tau_{0}\beta^{2}))}. (50)

This concludes the induction since the condition on LL is satisfied.

Now, from (44) and using ητ0β21\eta\tau_{0}\beta^{2}\geq 1 we get

dt2tηϵΘ(1+8ητ0β2)𝒪(tη2τ0kϵ).d^{t}\leq 2t\eta\epsilon\Theta(1+8\eta\tau_{0}\beta^{2})\leq{\cal{O}}(t\eta^{2}\tau_{0}\sqrt{k}\epsilon). (51)

Finally, from (43) we have pt8tηϵΘβΘp^{t}\leq 8t\eta\epsilon\Theta\beta\leq\Theta. Using ητ0=𝒪(Kkσmin2𝒥(𝑾,𝑿S)log(kρ))\eta\tau_{0}={\cal{O}}({\frac{K}{k\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho})) and noting that αβ\alpha\leq\beta we have that for τt\tau\geq t

ϵ18τ0ηβ𝒪(α2βlog(k/ρ))𝒪(kσmin2(𝒥(𝑾,𝑿S))Kβlog(k/ρ)).\epsilon\leq{\frac{1}{8\tau_{0}\eta\beta}}\leq{\cal{O}}(\frac{\alpha^{2}}{\beta\log(\sqrt{k}/\rho)})\leq{\cal{O}}(\frac{{k}\sigma^{2}_{\min}({\cal{J}}(\bm{W},{\bm{X}}_{S}))}{{K\beta}\log(\sqrt{k}/\rho)}).

Now we calculate the misclassification error. From Corollary A.3 and Eq. (43) we have that for η=12β2min(1,αβL2k)\eta=\frac{1}{2\beta^{2}}\min(1,\frac{\alpha\beta}{L\sqrt{2k}}) and Θ=2k\Theta=\sqrt{2k}, after τ=𝒪(Kηkσmin2𝒥(𝑾,𝑿S)log(kρ))\tau={\cal{O}}({\frac{K}{\eta k\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho})) iterations, we get

𝒓¯τ\displaystyle\|{\bar{\bm{r}}^{\tau}}\|_{\ell_{\infty}} 4ρand\displaystyle\leq 4\rho\quad\text{and}\quad (52)
pt2=𝒓τ𝒓¯τ2\displaystyle\|{p^{t}}\|_{\ell_{2}}=\|{\bm{r}^{\tau}-\bar{\bm{r}}^{\tau}}\|_{\ell_{2}} cϵKβkσmin2𝒥(𝑾,𝑿S)log(kρ).\displaystyle\leq c\epsilon{\frac{K\beta}{\sqrt{k}\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho}). (53)

To calculate the classification rate, we denote the residual vectors 𝒓¯τ=f(𝑾¯τ,𝑿S¯τ)𝒚~Sτ\bar{\bm{r}}^{\tau}=f(\bar{\bm{W}}^{\tau},{\bm{X}}_{\bar{S}^{\tau}})-\tilde{\bm{y}}_{S^{\tau}} and 𝒓τ=f(𝑾τ,𝑿Sτ)𝒚~Sτ{\bm{r}}^{\tau}=f({\bm{W}}^{\tau},{\bm{X}}_{S^{\tau}})-\tilde{\bm{y}}_{S^{\tau}}. Now, we count the number of entries of 𝒓τ\bm{r}^{\tau} that is larger than the label margin δ/2\delta/2 in absolute value. Let \mathcal{I} be the set of entries satisfying this condition. For ii\in\mathcal{I} we have |riτ|δ/2|r^{\tau}_{i}|\geq\delta/2. Therefore,

|r¯iτ|+|riτr¯iτ||riτ+r¯iτr¯iτ|δ/2,|\bar{r}^{\tau}_{i}|+|r^{\tau}_{i}-\bar{r}^{\tau}_{i}|\geq|r^{\tau}_{i}+\bar{r}^{\tau}_{i}-\bar{r}^{\tau}_{i}|\geq\delta/2, (54)

Since ρ=(1γ)δ/8<δ/8\rho=(1-\gamma)\delta/8<\delta/8 for 0<γ10<\gamma\ll 1, we get 𝒓¯τ4ρ(1γ)δ/2\|{\bar{\bm{r}}^{\tau}}\|_{\ell_{\infty}}\leq 4\rho\leq(1-\gamma)\delta/2 and

|riτr¯iτ|γδ/2.|r^{\tau}_{i}-\bar{r}^{\tau}_{i}|\geq\gamma\delta/2. (55)

Thus, the sum of the entries with a larger error than the label margin is

𝒓τ𝒓¯τ1||γδ/2.\|{\bm{r}^{\tau}-\bm{\bar{r}}^{\tau}}\|_{\ell_{1}}\geq|\mathcal{I}|\gamma\delta/2. (56)

Consequently, we have

||γδ/2𝒓τ𝒓¯τ1k𝒓τ𝒓¯τ2cϵKβσmin2𝒥(𝑾,𝑿S)log(kρ).|\mathcal{I}|\gamma\delta/2\leq\|{\bm{r}^{\tau}-\bm{\bar{r}}^{\tau}}\|_{\ell_{1}}\leq\sqrt{k}\|{\bm{r}^{\tau}-{\bm{\bar{r}}}^{\tau}}\|_{\ell_{2}}\leq c\epsilon{\frac{K\beta}{\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho}). (57)

Hence, the total number of errors is at most

||cϵKβγδσmin2𝒥(𝑾,𝑿S)log(kρ)=𝒪(ϵkβγδα2log(kρ)).|\mathcal{I}|\leq c^{\prime}\epsilon{\frac{K\beta}{\gamma\delta\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho})={\cal{O}}(\frac{\epsilon k\beta}{\gamma\delta\alpha^{2}}\log(\frac{\sqrt{k}}{\rho})). (58)

For the network to classify all the data points correctly, we need ||cϵKβγδσmin2𝒥(𝑾,𝑿S)log(kρ)<1|\mathcal{I}|\leq c^{\prime}\epsilon{\frac{K\beta}{\gamma\delta\sigma^{2}_{\min}{\cal{J}}(\bm{W},{\bm{X}}_{S})}}\log(\frac{\sqrt{k}}{\rho})<1. Hence, we get that

ϵ<c0γδσmin2(𝒥(𝑾,𝑿S))Kβlog(kρ)<c1δσmin2(𝒥(𝑾,𝑿S))Kβlog(kρ).\epsilon<\frac{c_{0}\gamma\delta\sigma^{2}_{\min}({\cal{J}}(\bm{W},{\bm{X}}_{S}))}{K\beta\log(\frac{\sqrt{k}}{\rho})}<\frac{c_{1}\delta\sigma^{2}_{\min}({\cal{J}}(\bm{W},{\bm{X}}_{S}))}{K\beta\log(\frac{\sqrt{k}}{\rho})}. (59)