Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime
Abstract
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.
1 Introduction
A surprising statistical phenomenon has emerged in modern machine learning: highly complex models can interpolate training data while still generalizing well to test data, even in the presence of label noise. This is rather striking as it the goes against the grain of the classical statistical wisdom which dictates that predictors that generalize well should trade off between the fit to the training data and the some measure of the complexity or smoothness of the predictor. Many estimators like neural networks, kernel estimators, nearest neighbour estimators, and even linear models have been shown to demonstrate this phenomenon [Zha+17, Bel+19, see,].
This phenomenon has recently inspired intense theoretical research. One line of work [Sou+18, JT19, Gun+17, NSS19, Gun+18a, Gun+18] formalized the argument [NTS14, Ney17] that, even when there is no explicit regularization that is used in training these rich models, there is nevertheless implicit regularization encoded in the choice of the optimization method used. For example, in the setting of linear classification, [Sou+18], [JT19] and [NSS19] show that learning a linear classifier using gradient descent on the unregularized logistic or exponential loss asymptotically leads the solution to converge to the maximum -margin classifier. More concretely, given linearly separable samples , where are the features and , the iterates of gradient descent (initialized at the origin) are given by,
They show that in the large- limit the normalized predictor obtained by gradient descent converges to where,
(1) | ||||
such that, |
That is, is the maximum -margin classifier over the training data.
The question still remains, though, why do these maximum margin classifiers generalize well beyond the training set, despite the fact that they “fit the noise”? The fact that renders traditional distribution-free bounds [Cov65, Vap82] vacuous. Due to the presence of label noise, margin bounds [Vap95, Sha+98] are also not an obvious answer.

In this paper, we prove an upper bound on the misclassification test error for the maximum margin linear classifier, and therefore on the the limit of gradient descent on the training error without any complexity penalty. Our analysis holds under a natural and fairly general generative model for the data. One special case is where adversarial label noise [KSS94, Kal+08, KLS09, ABL17, Tal20, see] is added to data in which the positive examples are distributed as and the negative examples are distributed . If is not too small, the clean data will consist of overlapping but largely separate clouds of points. Our assumptions are weaker than this, however (see Section 2 for the details). They are satisfied by the case in which misclassification noise is added to the generative model underlying Fisher’s linear discriminant [DHS12, HTF09, see] (except that, to make the analysis cleaner, the distribution is shifted so that the origin is halfway between the class-conditional means). They also include as special cases the rare-weak model [DJ08, Jin09] and a Boolean variant [HL12]. We study the overparameterized regime, when the dimension is significantly greater than the number of samples. For a precise statement of our main result see Theorem 4. After its statement, we give examples of its consequences, including cases in which of the variables are relevant, but weakly associated with the class designations. In some cases where , and are polynomially related, the risk of the maximum-margin algorithm approaches the Bayes-optimal risk as , for .
Analysis of classification is hindered by the fact that, in contrast to regression, there is no known simple expression for the parameters as a function of the training data. Our analysis leverages recent results, mentioned above, that characterize the weight vector obtained by minimizing the logistic loss on training data [Sou+18, JT19, NSS19]. We use this result not only to motivate the analysis of the maximum margin algorithm, but also in our proofs, to get a handle on the relationship of this solution to the training data. When learning in the presence of label noise, algorithms that minimize a convex loss face the hazard that mislabeled examples can exert an outsized influence. However, we show that in the over-parameterized regime this effect is ameliorated. In particular, we show that the ratio between the (exponential) losses of any two examples is bounded above by an absolute constant. One special case of our upper bounds is where there are relatively few relevant variables, and many irrelevant variables. In this case, classification using only parameters that correctly classify the clean examples with a large margin leads to large loss on examples with noisy class labels. However, the training process can use the parameters on irrelevant variables to play a role akin to slack variables, allowing the algorithm to correctly classify incorrectly labeled training examples with limited harm on independent test data. On the other hand, if there are too many irrelevant variables, accurate classification is impossible [Jin09]. Our bounds reflect this reality—if the number of irrelevant variables increases while the number and quality of the relevant variables remains fixed, ultimately our bounds degrade.
In simulation experiments, we see a decrease in population risk with the increase of beyond , as observed in previous double-descent papers, but this is followed by an increase. As mentioned above, [Jin09] showed that, under certain conditions, if the number of attributes and the number of relevant attributes satisfy , then, in a sense, learning is impossible. Our experiments suggest that interpolation with logistic loss can succeed close to this boundary, despite the lack of explicit regularization or feature selection.
1.1 Related Work
A number of recent papers have focused on bounding the asymptotic error of overparameterized decision rules. [Has+19] and [Mut+20a] studied the asymptotic squared error of the interpolating ordinary least squares estimator for the problem of overparameterized linear regression. This was followed by [MM19] who characterize the asymptotic error of the OLS estimator in the random features model. As we do, [Mon+19] studied linear classification, calculating a formula for the asymptotic test error of the maximum margin classifier in the overparameterized setting when the features are generated from a Gaussian distribution and the labels are generated from a logistic link function. This was followed by the work of [LS20] who calculated a formula for the asymptotic test error of the maximum -margin classifier in the same setting. [DKT20] independently obtained related results, including analysis of the case where the marginal distribution over the covariates is a mixture of Gaussians, one for each class. Previously, [CS20, SC19] studied the asymptotic test error for this problem in the underparameterized regime (when ). In contrast with this previous work, we provide finite-sample bounds.
There has also been quite a bit of work on the non-asymptotic analysis of interpolating estimators. [LR20] provided a finite-sample upper bound the expected squared error for kernel “ridgeless” regressor, which interpolates the training data. [KLS20] provided an analysis of linear regression that emphasized the role of irrelevant variables as providing placeholders for learning parameters that play the role of slack variables. [BHX20] provided a finite-sample analysis of interpolating least-norm regression with feature selection. They showed that, beyond the point where the number of features included in the model exceeds the number of training examples, the excess risk decreases with the number of included features. This analysis considered the case that the covariates have a standard normal distribution. They also obtained similar results for a “random features” model. [Bar+20] provided non-asymptotic upper and lower bounds on the squared error for the OLS estimator; their analysis emphasized the effect of the covariance structure of the independent variables on the success or failure of this estimator. This earlier work studied regression; here we consider classification. Study of regression is facilitated by the fact that the OLS parameter vector has a simple closed-form expression as a function of the training data. [BHM18] studied the generalization error for a simplicial interpolating nearest neighbor rule. [BRT19] provided bounds on the generalization error for the Nadaraya-Watson regression estimator applied to a singular kernel, a method that interpolates the training data. [LRZ20] provided upper bounds on the population risk for the least-norm interpolant applied to a class of kernels including the Neural Tangent Kernel.
In concurrent independent work, [Mut+20] studied the generalization properties of the maximum margin classifier in the case where the marginal distribution on the covariates is a single Gaussian, rather than a Gaussian per class. They showed that, in this setting, if there is enough overparameterization, every example is a support vector, so that the maximum margin algorithm outputs the same parameters as the OLS algorithm. They also showed that the accuracy of the model, measured using the 0-1 test loss, can be much better than its accuracy with respect to the quadratic loss.
Additional related work is described in Section 6.
2 Definitions, Notation and Assumptions
Throughout this section, and denote absolute constants. We will show that any choice that is large enough relative to will work.
We study learning from independent random examples sampled from a joint distribution . This distribution may be viewed as a noisy variant of another distribution which we now describe. A sample from may be generated by the following process.
-
1.
First, a clean label is generated by flipping a fair coin.
-
2.
Next, is sampled from , which is an arbitrary product distribution over
-
•
whose marginals are all zero-mean sub-Gaussians with sub-Gaussian norm at most (see Definition 15), and
-
•
such that .
-
•
-
3.
For an arbitrary unitary matrix and , . This ensures that the mean of is when and is when .
-
4.
Finally, noise is modeled as follows. For , is an arbitrary distribution over
-
•
whose marginal distribution on is the same as , and
-
•
such that .
Note that this definition includes the special case where is obtained from by flipping it with probability .
-
•
Choosing a bound of on the sub-Gaussian norm of the components of fixes the scale of the data. This simplifies the proofs without materially affecting the analysis, since rescaling the data does not affect the accuracy of the maximum margin algorithm.
Let be training examples drawn according to . Let
When is linearly separable, let minimize subject to . (In the setting that we analyze, we will show that, with high probability, is linearly separable. When it is not, may be chosen arbitrarily.)
We will provide bounds on the misclassification probability of the classifier parameterized by that can be achieved with probability over the draw of the samples.
We make the following assumptions on the parameters of the problem:
-
(A.1)
the failure probability satisfies ,
-
(A.2)
number of samples satisfies ,
-
(A.3)
the dimension satisfies ,
-
(A.4)
the norm of the mean satisfies .
Here are some examples of generative models that fall within our framework.
Example 1 (Gaussian class-conditional model).
The clean labels are drawn by flipping a fair coin. The distribution on , after conditioning on the value of , is , for with and (here is the matrix operator norm).
Example 2 (Noisy rare-weak model).
A special case of the model described above is when and the mean vector is such that only components are non-zero and all non-zero entries are equal to .
[DJ08] studied this model in the noise-free case (i.e. where ).
Example 3 (Boolean noisy rare-weak model).
Our assumptions are also satisfied111Strictly speaking, needs to be scaled down to make the sub-Gaussian norm less than for this to be true, but this does not affect the accuracy of the maximum margin classifier. by the following setting with Boolean attributes.
-
•
is generated first, by flipping a fair coin.
-
•
For , the components of are conditionally independent given : are equal to with probability , are equal to with probability .
-
•
is obtained from by flipping it with probability .
The noiseless setting of this model was studied by [HL12].
3 Main Result and Its Consequences
Our main result is a finite-sample bound on the misclassification error of the maximum margin classifier.
Theorem 4.
For all , there is an absolute constant such that, under the assumptions of Section 2, for all large enough , with probability , training on produces a maximum margin classifier satisfying
Consider the scenario where the number of samples is a constant, but where the number of dimensions and are growing. Then our assumptions require .222 The definitions of “big Oh notation”, i.e. , , , may be found in [Cor+09]. But, for the misclassification error to decrease we need . Thus if, for any then as , the misclassification error asymptotically will approach the noise level .
Here are the implications of our results in the noisy rare-weak model. Recall that in this model is non-zero only on coordinates and the non-zero coordinates of are equal to some . Therefore, .
Corollary 5.
There is an absolute constant such that, under the assumptions of Section 2, in the noisy rare-weak model, for any and all large enough , with probability , training on produces a maximum margin classifier satisfying
Next, let us examine the implications of our results in the Boolean noisy rare-weak model. Here, .
Corollary 6.
There is an absolute constant such that, under the assumptions of Section 2, in the Boolean noisy rare-weak model for any and all large enough , with probability , training on produces a maximum margin classifier satisfying
To gain some intuition let us explore the scaling of the misclassification error in these problems in different scaling limits for the parameters in both these problems.
Consider a case where, , and are constants and and grow. Our assumptions hold if . But for the misclassification error to decrease we need . So if where, then the misclassification error scales as and asymptotically approaches .
[Jin09] showed that for the noiseless rare-weak model learning is impossible when and is a constant. Our upper bounds show that, in a sense, the maximum margin classifier succeeds arbitrarily close to this threshold.
Another interesting scenario is when and are constants while both and grow as a function of the number of samples . Let and , for positive and . Our assumptions are satisfied if for large enough , while, for the misclassification error to reduce with we need . As gets larger the bound on the misclassification error scales as and gets arbitrarily close to for large enough . Informally, if the adversary fully expends its noise budget, the Bayes error rate will be at least ; this is true in particular in the case where labels are flipped with probability . In such cases, even if one could prove that the training data likely to be separated by a large margin, the bound of Theorem 4 approaches the Bayes error rate faster than the standard margin bounds [Vap95, Sha+98].
4 Proof of Theorem 4
First, we may assume without loss of generality that . To see this, note that
-
•
if is the maximum margin classifier for then is the maximum margin classifier for , and
-
•
the probability that is the same as the probability that .
Let us assume from now on that .
Our first lemma is an immediate consequence of the coupling lemma [Lin02, Das11] that allows us to handle the noise in the samples.
Lemma 7.
There is a joint distribution on such that
-
•
the marginal on is ,
-
•
the marginal on is ,
-
•
, and
-
•
.
Definition 8.
Let be i.i.d. draws from the coupling of Lemma 7, with the redundant thrown out. Let be the set of indices of “noisy” examples, and be the indices of “clean” examples.
Note that Lemma 7 implies that are i.i.d. draws from , as before.
The next lemma is bound on the misclassification error in terms of the expected value of the margin on clean points, , and the norm of the classifier .
Lemma 9.
There is an absolute positive constant such that
Proof.
In light of the previous lemma, next we prove a high probability lower bound on the expected margin on a clean point, .
Lemma 10.
For all , there is an absolute positive constant such that, for all large enough , with probability over the random choice of , it is linearly separable, and the maximum margin weight vector satisfies,
Given these two main lemmas above, the main theorem follows immediately.
Proof.
It remains to prove Lemma 10, a lower bound on the expected margin on clean points . This crucial lemma is proved through a series of auxiliary lemmas, which use a characterization of the maximum margin classifier in terms of iterates of gradient descent on the exponential loss. Denote the risk associated with the exponential loss333We could also work with the logistic loss here, but the proofs are simpler if we work with the exponential loss without changing the conclusions. as
Then the iterates of gradient descent are defined as follows:
-
•
, and
-
•
,
where is a constant step-size.
Lemma 11 (Soudry et al., 2018, Theorem 3).
For any linearly separable and for all small enough step-sizes , we have
Definition 12.
For each index of an example, let .
Most of the argument required to prove Lemma 9 is deterministic apart from some standard concentration arguments, which are gathered in the following lemma. (Recall that, since we are in the process of proving Theorem 4, the assumptions of Section 2 are in scope.)
Lemma 13.
For all , there is a such that, for all , for all large enough , with probability over the draw of the samples the following events simultaneously occur:
(2) | |||
(3) | |||
(4) | |||
(5) | |||
(6) | |||
The samples are linearly separable. | (7) |
The proof of this lemma is in Appendix A.
From here on, we will assume that satisfies all the conditions shown to hold with high probability in Lemma 13.
A concern is that, late in training, noisy examples will have outsized effect on the classifier learned. Lemma 14 below limits the extent to which this can be true. It shows that throughout the training process the loss on any one example is at most a constant factor larger than the loss on any other example. This is sufficient since the gradient of the exponential loss
is the sum of the values weighted by their losses. We also know that with high probability , therefore, showing that the loss on a sample is within a constant factor of the loss of any other sample controls the influence that any one point can have on the learning process. We formalize this intuition in the proof of Lemma 10 in the sequel.
As will be clear in the proof of Lemma 14, the high dimensionality of the classifier ( being larger than and ) is crucial in showing that the ratio of the losses between any pair of points is bounded. Here is some rough intuition why this is the case.
For the sake of intuition consider the extreme scenario where all the vectors are mutually orthogonal and , for all . Then in this case, the change in the loss of a sample due to each gradient descent update will be independent of any other sample and all the losses will decrease exactly at the same rate. Lemma 13 implies that, when is large enough relative to , the vectors are nearly pairwise orthogonal. In this case, the losses remain within a constant factor of one another.
Lemma 14.
There is an absolute constant such that, for all large enough , and all small enough step sizes , for all iterations ,
Proof.
is the maximum ratio between a pair of samples at iteration . Let be the constant from Lemma 13. We will prove that for all by using an inductive argument over the iterations .
Let us begin by establishing this for the base case, when . Since the gradient descent algorithm is initialized at the origin, the loss for any sample is . Therefore, .
Assume that the inductive hypothesis holds for some iteration , we shall now prove that then it must also hold at iteration .
To simplify notation we shall analyze the ratio between the losses on the first and the second sample but a similar analysis holds for any distinct pair. Let be the loss on sample and let be the loss on sample at the iteration. Define to be the ratio of the losses at iteration .
By the definition of as the gradient descent iterate
Recalling that is the constant from Lemma 13, by (2), we have
and (20) gives
These, combined with the the expression for above, give
which implies
(8) |
Consider two disjoint cases.
Case 1 (): Using inequality (4)
where follows since the sum of the losses on all samples is always smaller than the initial loss which is (see Lemma 25) and , while, follows as the step-size may be chosen to be at most .
Case 2 () : Reusing inequality (4),
Since and , and noting that Lemma 13 is consistent with being arbitrarily large while remains fixed, we have that, in this case, (as the term in the exponent is non-positive). This completes the proof of the inductive step in this case, and therefore the entire proof.
∎
4.1 Proof of Lemma 10
Let us proceed assuming that the event defined in Lemma 13 occurs, and, in this proof, let be the constant from that lemma. We know that this event occurs with probability at least .
We have
Dividing the sum into the clean and noisy examples, we have
(9) | ||||
Combining (4), (5) and (9) we infer
(10) |
Since , where is an arbitrarily small constant, if is the constant from Lemma 14, we have
since . Thus inequality (10) implies
Unrolling this via an induction yields
Now let us multiply both sides by
Next, let us take the large- limit. Applying Lemma 11 to the left hand side,
(11) |
By definition of the gradient descent iterates
This together with inequality (11) yields
completing the proof.
5 Simulations
We experimentally study the behavior of the maximum margin classifier in the overparameterized regime on synthetic data generated according to the Boolean noisy rare-weak model. Recall that this is a model where the clean label is first generated by flipping a fair coin. Then the covariate is drawn from a distribution conditioned on such that of the coordinates of are equal to with probability and the other coordinates are random and independent of the true label. The noisy label is obtained by flipping with probability . In this section the flipping probability is always . In all our experiments the number of samples is kept constant at .
In the first experiment in Figure 2 we hold and the number of relevant attributes constant and vary the dimension for different values of . We find that after an initial dip in the test error (for ) the test error starts to rise slowly with , as in our upper bounds.
Next, in Figure 3 we explore the scaling of the test error with the number of relevant attributes when and are held constant. As we would expect, the test error decreases as grows for all the different values of .
Finally, in Figure 4 we study how the test error changes when both and are increasing when and are held constant. Our results (see Corollary 6) do not guarantee learning when (and [Jin09] proved that learning is impossible in a related setting, even in the absence of noise); we find that the test error remains constant in our experiment in this setting. In the cases when and when , slightly beyond this threshold, the test error approaches the Bayes-optimal error as gets large in our experiment. This provides experimental evidence that the maximum margin algorithm, without explicit regularization or feature selection, even in the presence of noise, learns with using a number of relevant variables near the theoretical limit of what is possible for any algorithm. [HL12, Note that, as emphasized by]



6 Additional Related Work
[NJ02] compared the Naive Bayes algorithm, which builds a classifier from estimates of class-conditional distributions using conditional independence assumptions, with discriminative training of a logistic regressor. Their main point is that Naive Bayes converges faster. Our analysis provides a counterpoint to theirs, showing that, for a reasonable data distribution that includes label noise, in the overparameterized regime, unregularized discriminative training with a commonly used loss function learns a highly accurate classifier from a constant number of examples.
Analysis of learning with class-conditional Gaussians with the same covariance structure has been extensively studied in the case that the number of training examples is greater than the number of parameters [And03, see]. When , [BL04] showed that, even when the class-conditional distributions do not have diagonal covariance matrices, behaving as if they are can lead to improved classification accuracy. The model that we use is a generalization of the rare-weak model studied by [DJ08] (see Section 2 for details of our set-up). The class-conditional distributions studied there have a standard multivariate normal distribution, while our results hold for a more general class of sub-Gaussian class-conditional distributions. More importantly, in order to address the experimental findings of [Zha+17], we have have also supplemented the rare-weak model to include label noise. Finite-sample bounds for algorithms using penalties, again, in the absence of label noise have been obtained [CL11, LPR15, LJ17, CZ19]. [DW18] studied regularized classification in the asymptotic framework where and go to infinity together. [FF08] and [Jin09] proved that learning with class-conditional Gaussians is impossible when too few variables are associated with the class designations. Our analysis shows that, even in the presence of misclassification noise, in a sense, the maximum-margin algorithm succeeds up to the edge of the frontier established by one of the results by [Jin09]. [NK19] used a data distribution like the distributions considered here, but to analyze limitations of uniform convergence tools.
The framework studied here also includes as a special case the setting studied by [HL12], with Boolean attributes; again, a key modification is the addition of misclassification noise. Also, while the upper bounds of [HL12] are for algorithms that perform unweighted votes over selected attributes, here we consider the maximum margin algorithm. A more refined analysis of learning with conditionally independent Boolean attributes was carried out by [BK15]. [KA18] studied learning with conditionally independent Boolean attributes in the presence of noise—they analyzed tasks other than classification, including estimating the degree of association between the attributes (viewed in that work as experts) and the true class designations.
As mentioned above, we consider the case that the data is corrupted with label noise. We consider adversarial label noise [KSS94, Kal+08, KLS09, ABL17, Tal20]. In this model, an adversary is allowed to change the classifications of an arbitrary subset of the domain whose probability is , while leaving the marginal distribution on the covariates unchanged. It includes as a special case the heavily studied situation in which classifications are randomly flipped with probability [AL88, Kea98, Ces+99, Ser99, KS05, LS10, VMW15] along with variants that allow limited dependence of the probability that a label is corrupted on the clean example [Lug92, Nat+13, SBH13, CFS20]. Adversarial label noise allows for the possibility that noise is concentrated in a part of the domain, where noisy examples have greater potential to coordinate their effects; it is a weaker assumption than even Massart noise [MN06, BBM08, Awa+15, DGT19], which requires a separate limit on the conditional probability of an incorrect label, given any clean example. We show that, with sufficient overparameterization, even in the absence of regularity in the noise, the algorithm that simply minimizes the standard softmax loss without any explicit regularization enjoys surprisingly strong noise tolerance.
7 Discussion
Even in the presence of misclassification noise, with sufficient overparameterization, unregularized minimization of the logistic loss produces accurate classifiers when the clean data has well-separated sub-Gaussian class-conditional distributions.
We have analyzed the case of a linear classifier without a bias term. In the setting studied here, the Bayes-optimal classifier has a bias term of zero, and adding analysis of a bias term in the maximum margin classifier would complicate the analysis without significantly changing the results.
In the noisy rare-weak model, when and scale favorably with , and is a constant, the excess risk of the maximum margin algorithm decreases very rapidly with . One contributing cause is a “wisdom of the crowds” effect that is present when classifying with conditionally independent attributes: a classifier can be very accurate, even when the angle between its normal vector and the optimum is not very small. For example, if experts each predict a binary class, and they are correct independently with probability , a vote over their predictions remains over 95% accurate even if we flip the votes of of them. (Note that, even in some cases where Lemma 10 implies accuracy very close to optimal, it may not imply that the cosine of the angle between and is anywhere near .) On the other hand, the concentration required for successful generalization is robust to departures from the conditional independence assumption. Our assumptions already allow substantial class-conditional dependence among the attributes, but it may be interesting to explore even weaker assumptions.
We note that a bound on the accuracy of the maximum margin classifier with respect to the distribution without any label noise is implicit in our analysis. (The bound is the same as Theorem 4, but without the .)
Our bounds show that the maximum margin classifier approaches the Bayes risk as the parameters go to infinity in various ways. It would be interesting to characterize the conditions under which this happens. A related question is to prove lower bounds in terms of the parameters of the problem. Another is to prove bounds for finite and under weaker conditions.
We assume that the distributions of and are the same. This is useful in particular for simplifying the analysis of dot products between examples of opposite classes. It should not be difficult to extend the analysis meaningfully to remove this assumption—we use this assumption in this paper to keep the analysis as simple and clean as possible.
The distribution of comes from a sub-Gaussian distribution obtained by applying a unitary transformation to a latent variable with a product distribution. Concentration theorems have been proved under many conditions weaker than independence [SSS95, DR98, Pem01, see]. Our analysis can be straightforwardly extended using these weaker assumptions.
The assumption that the latent variable satisfies formalizes the notion that, in a sense, these variables are truly used, which is required for concentration. We believe that smaller values of are compatible with successful learning. We chose this assumption to facilitate a simple and clean analysis, but an analysis that separates the dependence on is a potential subject for future work.
The lower bounds on are needed for concentration, as described earlier. We suspect that the requirement that can be improved. The bottleneck is in the proof of Lemma 14. (As we mentioned earlier, a larger value of promotes the property that the loss on an example can be reduced by gradient descent without increasing the loss on other examples very much.)
The lower bound on allows us to focus on the case where most clean examples are classified correctly by a large margin, which is the case that we want to focus on. This could potentially be weakened or removed through a case analysis, exploiting the fact that a weaker bound is needed in the case that is small.
Using the generalized Hoeffding bound (Theorem 16) it is not hard to show [HL12, see] that, in our setting, the Bayes optimal classifier has error at most for an absolute constant , and Slud’s Lemma gives a similar lower bound [AB09, see] [HL12, see also]. Our upper bound of for the maximum margin algorithm applied to finite training data is worse than this by a factor of in the exponent.
Implicit regularization lemmas like the one that was so helpful to us have been obtained for other problems [Gun+17, Gun+18a, Woo+20, Azu+21, see]. We hope that further advances in implicit regularization research could be combined with the techniques of this paper to prove generalization guarantees for interpolating classifiers using richer model classes, including neural networks.
Acknowledgements
We thank anonymous reviewers for their valuable comments and suggestions. NC gratefully acknowledges the support of the NSF through grants IIS-1619362 and IIS-1909365.
Appendix A Concentration Inequalities
In this section we begin by presenting a definition of sub-Gaussian and sub-exponential random variables in terms of Orlicz norms. Then we state a version of Hoeffding’s inequality and a version of Bernstein’s inequality. Finally, we prove Lemma 13 which implies that a good event which our proofs rely on holds with high probability.
For an excellent reference of sub-Gaussian and sub-exponential concentration inequalities we refer the reader to [Ver18, Chapter 2 ].
Definition 15 (sub-Gaussian random variable).
A random variable is sub-Gaussian if
is bounded. Further, is defined to be its sub-Gaussian norm.
We now state general Hoeffding’s inequality [Ver18, see] a concentration inequality for a weighted sum of independent sub-Gaussian random variables.
Theorem 16 (General Hoeffding’s inequality).
Let be independent mean-zero sub-Gaussian random variables and . Then, for every , we have
where and is an absolute constant.
A one-sided version of this theorem (upper/lower deviation bound) holds without the factor of multiplying the exponent on the right hand side.
Definition 17 (sub-exponential random variable).
A random variable is said to be sub-exponential if
is bounded. Further, is defined to be its sub-exponential norm.
We shall also use Bernstein’s inequality [Ver18, see] a concentration inequality for a sum of independent sub-exponential random variables.
Theorem 18 (Bernstein’s inequality).
For independent mean-zero sub-exponential random variables , for every , we have
where is an absolute constant.
Again note that a one-sided version of this inequality holds without the factor of multiplying the exponent on the right hand side.
We break the proof of Lemma 13 into different parts, which are proved in separate lemmas. Lemma 13 then follows by a union bound.
Lemma 19.
For all , there is a such that, for all large enough , with probability at least , for all ,
Proof.
For any clean sample , the random variables are sub-exponential with norm
After centering, the sub-exponential norm of the zero-mean random variable is at most a constant [Ver18, see]. Therefore, by Bernstein’s inequality, there is an absolute constant such that
By setting
since for a large constant . Recall that by assumption, , where the upper bound follows from the assumption that the components of have sub-Gaussian norm at most . Recalling that ,
(12) |
with probability at least .
By Young’s inequality for products, . Also recall that by assumption . Combining this with the left hand side in the display above, for large enough , we have
Again by Young’s inequality . Therefore,
with the same probability.
A similar argument also holds for all noisy samples by considering the random variables . Hence, by taking a union bound over all samples completes the proof. ∎
Lemma 20.
There is a such that, for all large enough , with probability at least , for all ,
Proof.
First, let us condition on the division of into clean points and noisy points . After this, for each , (where the expectation is conditioned on being clean), and for each , . For each , let . The same logic that proved (12), together with a union bound, yields
(13) |
For any pair of indices of examples, we have
(14) |
If we regard as fixed, and only as random, Theorem 16 gives
Thus,
for . Substituting into inequality (14), we infer
Taking a union bound over all pairs for the first term, and all individuals for the second term, we get
Choosing for a large enough value of , we have
Applying (13), we get
(15) |
For any , clean or noisy, Hoeffding’s inequality implies
Since , this implies
Therefore, by taking a union bound over all
(16) |
Both the events in (15) and (16) will simultaneously hold with probability at most . Assume that the event complementary to this bad event occurs, then for any distinct pair and
which completes our proof. ∎
Lemma 21.
For all large enough , with probability at least ,
Proof.
If is a clean point then, . Therefore the random variable has sub-Gaussian norm at most . By applying Hoeffding’s inequality,
since . Taking a union bound over all clean points establishes the claim. ∎
Lemma 22.
For all large enough , with probability at least ,
Proof.
The proof is the same as the proof of Lemma 21, except that for any noisy sample, , the conditional mean . ∎
Lemma 23.
For all , for all large enough , with probability the number of noisy samples satisfies .
Proof.
Since , this follows from a Hoeffding bound. ∎
Proof.
Let . For each and any ,
for , completing the proof. ∎
Appendix B Decreasing Loss
Lemma 25.
For all small enough step sizes , for all iterations , .
Proof.
Since , it suffices to prove that, for all , . Toward showing this, note that, if is the constant from Lemma 13, the operator norm of the Hessian at any solution may be bound as follows:
This implies that is -smooth over those such that . This implies that, for , if then [JT19, this can be seen, for example, by mirroring the argument used in Lemma B.2 in]. The lemma then follows using induction. ∎
References
- [And03] T.W. Anderson “An introduction to multivariate statistical analysis”, Wiley Series in Probability and Statistics Wiley, 2003
- [AL88] Dana Angluin and Philip Laird “Learning from noisy examples” In Machine Learning 2.4, 1988, pp. 343–370
- [AB09] Martin Anthony and Peter L Bartlett “Neural network learning: Theoretical foundations” Cambridge University Press, 2009
- [ABL17] Pranjal Awasthi, Maria Florina Balcan and Philip M Long “The power of localization for efficiently learning linear separators with noise” In Journal of the ACM (JACM) 63.6, 2017, pp. 1–27
- [Awa+15] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab and Ruth Urner “Efficient learning of linear separators under bounded noise” In Conference on Learning Theory, 2015, pp. 167–190
- [Azu+21] Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson and Daniel Soudry “On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent” In arXiv preprint arXiv:2102.09769, 2021
- [Bar+20] Peter L Bartlett, Philip M Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070
- [Bel+19] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine-learning practice and the classical bias–variance trade-off” In Proceedings of the National Academy of Sciences 116.32 National Academy of Sciences, 2019, pp. 15849–15854
- [BHX20] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180
- [BHM18] Mikhail Belkin, Daniel J Hsu and Partha Mitra “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate” In Advances in Neural Information Processing Systems, 2018, pp. 2300–2311
- [BRT19] Mikhail Belkin, Alexander Rakhlin and Alexandre B Tsybakov “Does data interpolation contradict statistical optimality?” In International Conference on Artificial Intelligence and Statistics, 2019, pp. 1611–1619
- [BK15] Daniel Berend and Aryeh Kontorovich “A finite sample analysis of the Naive Bayes classifier.” In Journal of Machine Learning Research 16.44, 2015, pp. 1519–1545
- [BL04] Peter J Bickel and Elizaveta Levina “Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations” In Bernoulli 10.6, 2004, pp. 989–1010
- [BBM08] Gilles Blanchard, Olivier Bousquet and Pascal Massart “Statistical performance of support vector machines” In The Annals of Statistics 36.2, 2008, pp. 489–531
- [CL11] Tony Cai and Weidong Liu “A direct estimation approach to sparse linear discriminant analysis” In Journal of the American statistical association 106.496, 2011, pp. 1566–1577
- [CZ19] Tony Cai and Linjun Zhang “High dimensional linear discriminant analysis: optimality, adaptive algorithm and missing data” In Journal of the Royal Statistical Society: Series B (Statistical Methodology) 81.4, 2019, pp. 675–705
- [CS20] Emmanuel J Candès and Pragya Sur “The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression” In The Annals of Statistics 48.1, 2020, pp. 27–42
- [CFS20] Timothy I Cannings, Yingying Fan and Richard J Samworth “Classification with imperfect training labels” In Biometrika 107.2, 2020, pp. 311–330
- [Ces+99] Nicolo Cesa-Bianchi, Eli Dichterman, Paul Fischer, Eli Shamir and Hans Ulrich Simon “Sample-efficient strategies for learning in the presence of noise” In Journal of the ACM (JACM) 46.5, 1999, pp. 684–719
- [CL20] Niladri S Chatterji and Philip M Long “Finite-sample analysis of interpolating linear classifiers in the overparameterized regime” In arXiv preprint arXiv:2004.12019, 2020
- [Cor+09] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest and Clifford Stein “Introduction to algorithms” MIT Press, 2009
- [Cov65] Thomas M Cover “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition” In IEEE transactions on electronic computers, 1965, pp. 326–334
- [Das11] Constantinos Daskalakis “Probability and Computation, Lecture 3” Scribes: A. Chiesa and Z. Zhu. (Downloaded 4/13/20)., http://people.csail.mit.edu/costis/6896sp11/lec3s.pdf, 2011
- [DKT20] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A Model of Double Descent for High-Dimensional Logistic Regression” In ICASSP, 2020, pp. 4267–4271
- [DGT19] Ilias Diakonikolas, Themis Gouleakis and Christos Tzamos “Distribution-independent PAC learning of halfspaces with Massart noise” In Advances in Neural Information Processing Systems, 2019, pp. 4749–4760
- [DW18] Edgar Dobriban and Stefan Wager “High-dimensional asymptotics of prediction: Ridge regression and classification” In The Annals of Statistics 46.1, 2018, pp. 247–279
- [DJ08] David Donoho and Jiashun Jin “Higher criticism thresholding: Optimal feature selection when useful features are rare and weak” In Proceedings of the National Academy of Sciences 105.39, 2008, pp. 14790–14795
- [DR98] Devdatt Dubhashi and Desh Ranjan “Balls and Bins: A Study in Negative Dependence” In Random Structures & Algorithms 13.5, 1998, pp. 99–124
- [DHS12] Richard O Duda, Peter E Hart and David G Stork “Pattern classification” John Wiley & Sons, 2012
- [FF08] Jianqing Fan and Yingying Fan “High dimensional classification using features annealed independence rules” In The Annals of Statistics 36.6, 2008, pp. 2605
- [Gun+18] Suriya Gunasekar, Jason Lee, Daniel Soudry and Nathan Srebro “Characterizing Implicit Bias in Terms of Optimization Geometry” In International Conference on Machine Learning, 2018, pp. 1832–1841
- [Gun+18a] Suriya Gunasekar, Jason D Lee, Daniel Soudry and Nathan Srebro “Implicit bias of gradient descent on linear convolutional networks” In Advances in Neural Information Processing Systems, 2018, pp. 9461–9471
- [Gun+17] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur and Nathna Srebro “Implicit regularization in matrix factorization” In Advances in Neural Information Processing Systems, 2017, pp. 6151–6159
- [Has+19] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J Tibshirani “Surprises in high-dimensional ridgeless least squares interpolation” In arXiv preprint arXiv:1903.08560, 2019
- [HTF09] Trevor Hastie, Robert Tibshirani and Jerome Friedman “The elements of statistical learning: data mining, inference, and prediction” Springer Science & Business Media, 2009
- [HL12] David P Helmbold and Philip M Long “On the necessity of irrelevant variables” In Journal of Machine Learning Research 13.Jul, 2012, pp. 2145–2170
- [HMX21] Daniel Hsu, Vidya Muthukumar and Ji Xu “On the proliferation of support vectors in high dimensions” In International Conference on Artificial Intelligence and Statistics, 2021, pp. 91–99
- [JT19] Ziwei Ji and Matus Telgarsky “The implicit bias of gradient descent on nonseparable data” In Conference on Learning Theory, 2019, pp. 1772–1798
- [Jin09] Jiashun Jin “Impossibility of successful classification when useful features are rare and weak” In Proceedings of the National Academy of Sciences 106.22, 2009, pp. 8859–8864
- [Kal+08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour and Rocco A Servedio “Agnostically learning halfspaces” In SIAM Journal on Computing 37.6, 2008, pp. 1777–1805
- [KS05] Adam Tauman Kalai and Rocco A Servedio “Boosting in the presence of noise” In Journal of Computer and System Sciences 71.3, 2005, pp. 266–290
- [Kea98] Michael Kearns “Efficient noise-tolerant learning from statistical queries” In Journal of the ACM (JACM) 45.6, 1998, pp. 983–1006
- [KSS94] Michael J Kearns, Robert E Schapire and Linda M Sellie “Toward efficient agnostic learning” In Machine Learning 17.2-3, 1994, pp. 115–141
- [KA18] Matthäus Kleindessner and Pranjal Awasthi “Crowdsourcing with arbitrary adversaries” In International Conference on Machine Learning, 2018, pp. 2708–2717
- [KLS09] Adam R Klivans, Philip M Long and Rocco A Servedio “Learning halfspaces with malicious noise” In Journal of Machine Learning Research 10.Dec, 2009, pp. 2715–2740
- [KLS20] Dmitry Kobak, Jonathan Lomond and Benoit Sanchez “The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization” In Journal of Machine Learning Research 21.169, 2020, pp. 1–16
- [LPR15] Tianyang Li, Adarsh Prasad and Pradeep K Ravikumar “Fast classification rates for high-dimensional Gaussian generative models” In Advances in Neural Information Processing Systems, 2015, pp. 1054–1062
- [LJ17] Yanfang Li and Jinzhu Jia “L1 least squares for sparse high-dimensional LDA” In Electronic Journal of Statistics 11.1, 2017, pp. 2499–2518
- [LR20] Tengyuan Liang and Alexander Rakhlin “Just interpolate: Kernel “ridgeless” regression can generalize” In The Annals of Statistics 48.3, 2020, pp. 1329–1347
- [LRZ20] Tengyuan Liang, Alexander Rakhlin and Xiyu Zhai “On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels” In Conference on Learning Theory, 2020, pp. 2683–2711
- [LR21] Tengyuan Liang and Benjamin Recht “Interpolating Classifiers Make Few Mistakes” In arXiv preprint arXiv:2101.11815, 2021
- [LS20] Tengyuan Liang and Pragya Sur “A precise high-dimensional asymptotic theory for boosting and min--norm interpolated classifiers” In arXiv preprint arXiv:2002.01586, 2020
- [Lin02] Torgny Lindvall “Lectures on the coupling method” Courier Corporation, 2002
- [LS10] Philip M Long and Rocco A Servedio “Random classification noise defeats all convex potential boosters” In Machine learning 78.3, 2010, pp. 287–304
- [Lug92] Gabor Lugosi “Learning with an unreliable teacher” In Pattern Recognition 25.1, 1992, pp. 79–87
- [MN06] Pascal Massart and Élodie Nédélec “Risk bounds for statistical learning” In The Annals of Statistics 34.5, 2006, pp. 2326–2366
- [MM19] Song Mei and Andrea Montanari “The generalization error of random features regression: Precise asymptotics and double descent curve” In arXiv preprint arXiv:1908.05355, 2019
- [Mon+19] Andrea Montanari, Feng Ruan, Youngtak Sohn and Jun Yan “The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime” In arXiv preprint arXiv:1911.01544, 2019
- [Mut+20] Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu and Anant Sahai “Classification vs regression in overparameterized regimes: Does the loss function matter?” In arXiv preprint arXiv:2005.08054, 2020
- [Mut+20a] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020
- [NSS19] Mor Shpigel Nacson, Nathan Srebro and Daniel Soudry “Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate” In International Conference on Artificial Intelligence and Statistics, 2019, pp. 3051–3059
- [NK19] Vaishnavh Nagarajan and J Zico Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019, pp. 11615–11626
- [Nat+13] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar and Ambuj Tewari “Learning with noisy labels” In Advances in neural information processing systems, 2013, pp. 1196–1204
- [Ney17] Behnam Neyshabur “Implicit regularization in deep learning” In arXiv preprint arXiv:1709.01953, 2017
- [NTS14] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In search of the real inductive bias: On the role of implicit regularization in deep learning” In arXiv preprint arXiv:1412.6614, 2014
- [NJ02] Andrew Y Ng and Michael I Jordan “On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes” In Advances in Neural Information Processing Systems, 2002, pp. 841–848
- [Pem01] Sriram V Pemmaraju “Equitable coloring extends Chernoff-Hoeffding bounds” In Approximation, Randomization, and Combinatorial Optimization: Algorithms and Techniques Springer, 2001, pp. 285–296
- [SSS95] Jeanette P Schmidt, Alan Siegel and Aravind Srinivasan “Chernoff–Hoeffding bounds for applications with limited independence” In SIAM Journal on Discrete Mathematics 8.2 SIAM, 1995, pp. 223–250
- [SBH13] Clayton Scott, Gilles Blanchard and Gregory Handy “Classification with asymmetric label noise: Consistency and maximal denoising” In Conference on Learning Theory, 2013, pp. 489–511
- [Ser99] Rocco A Servedio “On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm” In Conference on Computational Learning Theory, 1999, pp. 296–307
- [Sha+98] John Shawe-Taylor, Peter L Bartlett, Robert C Williamson and Martin Anthony “Structural risk minimization over data-dependent hierarchies” In IEEE transactions on Information Theory 44.5, 1998, pp. 1926–1940
- [Sou+18] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar and Nathan Srebro “The implicit bias of gradient descent on separable data” In Journal of Machine Learning Research 19.1, 2018, pp. 2822–2878
- [SC19] Pragya Sur and Emmanuel J Candès “A modern maximum-likelihood theory for high-dimensional logistic regression” In Proceedings of the National Academy of Sciences 116.29, 2019, pp. 14516–14525
- [Tal20] Kunal Talwar “On the Error Resistance of Hinge-Loss Minimization” In Advances in Neural Information Processing Systems 33, 2020, pp. 4223–4234
- [VMW15] Brendan Van Rooyen, Aditya Menon and Robert C Williamson “Learning with symmetric label noise: The importance of being unhinged” In Advances in Neural Information Processing Systems, 2015, pp. 10–18
- [Vap82] Vladimir Vapnik “Estimation of dependences based on empirical data” Springer-Verlag, 1982
- [Vap95] Vladimir N Vapnik “The nature of statistical learning theory” Springer-Verlag, 1995
- [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
- [WT20] Ke Wang and Christos Thrampoulidis “Benign Overfitting in Binary Classification of Gaussian Mixtures” In arXiv preprint arXiv:2011.09148, 2020
- [Woo+20] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry and Nathan Srebro “Kernel and rich regimes in overparametrized models” In Conference on Learning Theory, 2020, pp. 3635–3673
- [Zha+17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017