For interpolating kernel machines, minimizing the norm of the ERM solution minimizes stability
Abstract
We study the average stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm minimizes a bound on stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error should be expected to follow a double descent curve.
1 Introduction
Statistical learning theory studies the learning properties of machine learning algorithms, and more fundamentally, the conditions under which learning from finite data is possible. In this context, classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures, such as combinatorial dimensions, covering numbers and Rademacher/Gaussian complexities (Shalev-Shwartz & Ben-David, 2014; Boucheron et al., 2005). Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data (Bousquet & Elisseeff, 2001; Kutin & Niyogi, 2002). In this view, the continuity of the process that maps data to estimators is crucial, rather than the complexity of the hypothesis space. Different notions of stability can be considered, depending on the data perturbation and metric considered (Kutin & Niyogi, 2002). Interestingly, the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other, and can be shown to be equivalent( Poggio et al. (2004) and Shalev-Shwartz et al. (2010)).
In modern machine learning overparameterized models, with a larger number of parameters than the size of the training data, have become common. The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process (Bühlmann & Van De Geer, 2011; Steinwart & Christmann, 2008). However, it was recently shown - first for deep networks (Zhang et al., 2017), and more recently for kernel methods (Belkin et al., 2019) - that learning is possible in the absence of regularization, i.e., when perfectly fitting/interpolating the data. Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding. Since learning using models that interpolate is not exclusive to deep neural networks, we study generalization in the presence of interpolation in the case of kernel methods. We study both linear and kernel least squares problems in this paper.
Our Contributions:
-
•
We characterize the generalization properties of interpolating solutions for linear and kernel least squares problems using a stability approach. While the (uniform) stability properties of regularized kernel methods are well known (Bousquet & Elisseeff, 2001), we study interpolating solutions of the unregularized ("ridgeless") regression problems.
-
•
We obtain an upper bound on the stability of interpolating solutions, and show that this upper bound is minimized by the minimum norm interpolating solution. This also means that among all interpolating solutions, the minimum norm solution has the best test error. In particular, the same conclusion is also true for gradient descent, since it converges to the minimal norm solution in the setting we consider, see e.g. Rosasco & Villa (2015).
-
•
Our stability bounds show that the average stability of the minimum norm solution is controlled by the condition number of the empirical kernel matrix. It is well known that the numerical stability of the least squares solution is governed by the condition number of the associated kernel matrix (see Poggio et al. (2019)). Our results show that the condition number also controls stability (and hence, test error) in a statistical sense.
Organization:
In section 2, we introduce basic ideas in statistical learning and empirical risk minimization, as well as the notation used in the rest of the paper. In section 3, we briefly recall some definitions of stability. In section 4, we study the stability of interpolating solutions to kernel least squares and show that the minimum norm solutions minimize an upper bound on the stability. In section 5 we discuss our results in the context of recent work on high dimensional regression. We conclude in section 6.
2 Statistical Learning and Empirical Risk Minimization
We begin by recalling the basic ideas in statistical learning theory. In this setting, is the space of features, is the space of targets or labels, and there is an unknown probability distribution on the product space . In the following, we consider and . The distribution is fixed but unknown, and we are given a training set consisting of samples (thus ) drawn i.i.d. from the probability distribution on , Intuitively, the goal of supervised learning is to use the training set to “learn” a function that evaluated at a new value should predict the associated value of , i.e.
The loss is a function , where is the space of measurable functions from to , that measures how well a function performs on a data point. We define a hypothesis space where algorithms search for solutions. With the above notation, the expected risk of is defined as which is the expected loss on a new sample drawn according to the data distribution . In this setting, statistical learning can be seen as the problem of finding an approximate minimizer of the expected risk given a training set . A classical approach to derive an approximate solution is empirical risk minimization (ERM) where we minimize the empirical risk .
A natural error measure for our ERM solution is the expected excess risk Another common error measure is the expected generalization error/gap given by These two error measures are closely related since, the expected excess risk is easily bounded by the expected generalization error (see Lemma 5).
2.1 Kernel Least Squares and Minimal Norm Solution
The focus in this paper is on the kernel least squares problem. We assume the loss function is the square loss, that is, The hypothesis space is assumed to be a reproducing kernel Hilbert space, defined by a positive definite kernel or an associated feature map , such that for all , where is the inner product in . In this setting, functions are linearly parameterized, that is there exists such that for all .
The ERM problem typically has multiple solutions, one of which is the minimal norm solution:
(1) |
Here is the norm on induced by the inner product. The minimal norm solution can be shown to be unique and satisfy a representer theorem, that is for all :
(2) |
where , is the by matrix with entries , , and is the Moore-Penrose pseudoinverse of . If we assume and that we have linearly independent data features, that is the rank of is , then it is possible to show that for many kernels one can replace by (see Remark 2). Note that invertibility is necessary and sufficient for interpolation. That is, if is invertible, for all , in which case the training error in (1) is zero.
Remark 1 (Pseudoinverse for underdetermined linear systems)
A simple yet relevant example are linear functions , that correspond to and the identity map. If the rank of is , then any interpolating solution satisfies for all , and the minimal norm solution, also called Moore-Penrose solution, is given by where the pseudoinverse takes the form
Remark 2 (Invertibility of translation invariant kernels)
Translation invariant kernels are a family of kernel functions given by where is an even function on . Translation invariant kernels are Mercer kernels (positive semidefinite) if the Fourier transform of is non-negative. For Radial Basis Function kernels () we have the additional property due to Theorem 2.3 of Micchelli (1986) that for distinct points the kernel matrix is non-singular and thus invertible.
The above discussion is directly related to regularization approaches.
Remark 3 (Stability and Tikhonov regularization)
Tikhonov regularization is used to prevent potential unstable behaviors. In the above setting, it corresponds to replacing Problem (1) by where the corresponding unique solution is given by In contrast to ERM solutions, the above approach prevents interpolation. The properties of the corresponding estimator are well known. In this paper, we complement these results focusing on the case .
Finally, we end by recalling the connection between minimal norm and the gradient descent.
Remark 4 (Minimum norm and gradient descent)
In our setting, it is well know the both batch and stochastic gradient iterations converge exactly to the minimal norm solution, when multiple solutions exist, see e.g. Rosasco & Villa (2015). Thus, a study of the properties of minimal norm solutions explain the properties of the solution to which gradient descent converges. In particular, when ERM has multiple interpolating solutions, gradient descent converges to a solution that minimizes a bound on stability, as we show next.
3 Error Bounds via Stability
In this section, we recall basic results relating the learning and stability properties of Empirical Risk Minimization (ERM). Throughout the paper, we assume that ERM achieves a minimum, albeit the extension to almost minimizer is possible (Mukherjee et al., 2006) and important for exponential-type loss functions (Poggio, 2020). We do not assume the expected risk to achieve a minimum. Since we will be considering leave-one-out stability in this section, we look at solutions to ERM over the complete training set and the leave one out training set
The excess risk of ERM can be easily related to its stability properties. Here, we follow the definition laid out in Mukherjee et al. (2006) and say that an algorithm is Cross-Validation leave-one-out () stable in expectation, if there exists such that for all ,
(3) |
This definition is justified by the following result that bounds the excess risk of a learning algorithm by its average stability.
Lemma 5 (Excess Risk & Stability)
For all ,
(4) |
Remark 6 (Connection to uniform stability and other notions of stability)
Uniform stability, introduced by Bousquet & Elisseeff (2001), corresponds in our notation to the assumption that there exists such that for all , Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006).
We present the proof of Lemma 5 in Appendix A.2 due to lack of space. In Appendix A, we also discuss other definitions of stability and their connections to concepts in statistical learning theory like generalization and learnability.
4 Stability of Kernel Least Squares
In this section we analyze the expected stability of interpolating solutions to the kernel least squares problem, and obtain an upper bound on their stability. We show that this upper bound on the expected stability is smallest for the minimal norm interpolating solution (1) when compared to other interpolating solutions to the kernel least squares problem.
We have a dataset and we want to find a mapping , that minimizes the empirical least squares risk. Here is a reproducing kernel hilbert space (RKHS) defined by a positive definite kernel . All interpolating solutions are of the form , where . Similarly, all interpolating solutions on the leave one out dataset can be written as , where . Here are the empirical kernel matrices on the original and leave one out datasets respectively. We note that when and , we obtain the minimum norm interpolating solutions on the datasets and .
Theorem 7 (Main Theorem)
Consider the kernel least squares problem with a bounded kernel and bounded outputs , that is there exist such that
(5) |
almost surely. Then for any interpolating solutions ,
(6) |
This bound is minimized when , which corresponds to the minimum norm interpolating solutions . For the minimum norm solutions we have , where and, , and are absolute constants that do not depend on either or .
In the above theorem refers to the operator norm of the kernel matrix , refers to the standard norm for , and is the condition number of the matrix .
We can combine the above result with Lemma 5 to obtain the following bound on excess risk for minimum norm interpolating solutions to the kernel least squares problem:
Corollary 8
The excess risk of the minimum norm interpolating kernel least squares solution can be bounded as:
where are as defined previously.
Remark 9 (Underdetermined Linear Regression)
In the case of underdetermined linear regression, ie, linear regression where the dimensionality is larger than the number of samples in the training set, we can prove a version of Theorem 7 with and . Due to space constraints, we present the proof of the results in the linear regression case in Appendix B.
4.1 Key lemmas
In order to prove Theorem 7 we make use of the following lemmas to bound the stability using the norms and the difference of the solutions.
Lemma 10
Under assumption (5), for all , it holds that
Proof We begin, recalling that the square loss is locally Lipschitz, that is for all , with
If we apply this result to in a RKHS ,
using the basic properties of a RKHS that for all
(7) |
In particular, we can plug and into the above inequality, and the almost positivity of ERM (Mukherjee et al., 2006) will allow us to drop the absolute value on the left hand side. Finally the desired result follows by taking the expectation over .
Now that we have bounded the stability using the norms and the difference of the solutions, we can find a bound on the difference between the solutions to the kernel least squares problem. This is our main stability estimate.
Lemma 11
Let be any interpolating kernel least squares solutions on the full and leave one out datasets (as defined at the top of this section), then , and is minimized when , which corresponds to the minimum norm interpolating solutions .
Also,
(8) |
Since the minimum norm interpolating solutions minimize both and (from lemmas 10, 11), we can put them together to prove theorem 7. In the following section we provide the proof of Lemma 11.
Simulation:
In order to illustrate that the minimum norm interpolating solution is the best performing interpolating solution we ran a simple experiment on a linear regression problem. We synthetically generated data from a linear model , where was i.i.d . The dimension of the data was and there were samples in the training dataset. A held out dataset of samples was used to compute the test mean squared error (MSE). Interpolating solutions were computed as and the norm of was varied to obtain the plot. The results are shown in Figure 1, where we can see that the training loss is 0 for all interpolants, but test MSE increases as increases, with having the best performance. The figure reports results averaged over trials.

4.2 Proof of Lemma 11
We can write any interpolating solution to the kernel regression problem as where , and is the kernel matrix on and is any vector in . i.e. , and is the vector .
Similarly, the coefficient vector for the corresponding interpolating solution to the problem over the leave one out dataset is . Where and is the kernel matrix with the row and column set to zero, which is the kernel matrix for the leave one out training set.
We define and as a one-hot column vector with all zeros apart from the component which is . Let . Then, we have:
(9) |
That is, we can write as a rank-2 update to . This can be verified by simple algebra, and using the fact that is a symmetric kernel. Now we are interested in bounding . For a function we have . So we have:
(10) |
Here we make use of the fact that . If has full rank (as in Remark 2), we see that lies in the column space of and lies in the column space of . Furthermore, . Using Theorem 6 in (Meyer, 1973) (equivalent to formula 2.1 of Baksalary et al. (2003)) with , we obtain:
(11) |
Above, we use the fact that the operator norm of a rank 1 matrix is given by and that . Also, using the corresponding formula from List 2 of Baksalary et al. (2003), we have .
Next, we see that for , lies in the column space of , and does not lie in the column space of . This means we can use Theorem 5 in (Meyer, 1973) (equivalent to formula 2.4 in (Baksalary et al., 2003)) to obtain the expression for
(12) |
Also, using the corresponding formula from List 2 of Baksalary et al. (2003), we have , which implies that .
5 Remark and Related Work
In the previous section we obtained bounds on the stability of interpolating solutions to the kernel least squares problem. Our kernel least squares results can be compared with stability bounds for regularized ERM (see Remark 3). Regularized ERM has a strong stability guarantee in terms of a uniform stability bound which turns out to be inversely proportional to the regularization parameter and the sample size (Bousquet & Elisseeff, 2001). However, this estimate becomes vacuous as . In this paper, we establish a bound on average stability, and show that this bound is minimized when the minimum norm ERM solution is chosen. We study average stability since one can expect worst case scenarios where the minimum norm is arbitrarily large (when ). Our main finding is in fact the relationship between minimizing the norm of the ERM solution and minimizing a bound on stability.
This leads to a second observation, namely, that we can consider the limit of our risk bounds as both the sample size () and the dimensionality of the data () go to infinity. This is a classical setting in statistics which allows us to use results from random matrix theory (Marchenko & Pastur, 1967). In particular, for linear kernels the behavior of the smallest eigenvalue of the kernel matrix (which appears in our bounds) can be characterized in this asymptotic limit. Here the dimension of the data coincides with the number of parameters in the model. Interestingly, analogous results hold for more general kernels (El Karoui, 2010) where the asymptotics are taken with respect to the number and dimensionality of the data. These results predict a double descent curve for the condition number as found in practice, see Figure 2.
Recently, there has been a surge of interest in studying linear and kernel least squares models, since classical results focus on situations where constraints or penalties that prevent interpolation are added to the empirical risk. For example, high dimensional linear regression is considered in Mei & Montanari (2019); Hastie et al. (2019); Bartlett et al. (2019), and “ridgeless” kernel least squares is studied in Liang et al. (2019); Rakhlin & Zhai (2018) and Liang et al. (2020). While these papers study upper and lower bounds on the risk of interpolating solutions to the linear and kernel least squares problem, ours are the first to derived using stability arguments. While it might be possible to obtain tighter excess risk bounds through careful analysis of the minimum norm interpolant, our simple approach helps us establish a link between stability in statistical and in numerical sense.
Finally, we can compare our results with observations made in Poggio et al. (2019) on the condition number of random kernel matrices. The condition number of the empirical kernel matrix is known to control the numerical stability of the solution to a kernel least squares problem. Our results show that the statistical stability is also controlled by the condition number of the kernel matrix, providing a natural link between numerical and statistical stability.

6 Conclusions
In summary, minimizing a bound on cross validation stability minimizes the expected error in both the classical and the modern regime of ERM. In the classical regime (), stability implies generalization and consistency for . In the modern regime (), as described in this paper, stability can account for the double descent curve in kernel interpolants ((Belkin et al., 2019) showed empirical evidence for closely related interpolation methods) under appropriate distributional assumptions. The main contribution of this paper is characterizing stability of interpolating solutions, in particular deriving excess risk bounds via a stability argument. In the process, we show that among all the interpolating solutions, the one with minimal norm also minimizes a bound on stability. Since the excess risk bounds of the minimum norm interpolant depend on the pseudoinverse of the kernel matrix, we establish here a clear link between numerical and statistical stability. This also holds for solutions computed by gradient descent, since gradient descent converges to minimum norm solutions in the case of “linear” kernel methods. Our approach is simple and combines basic stability results with matrix inequalities.
References
- Baksalary et al. (2003) Jerzy K Baksalary, Oskar Maria Baksalary, and Götz Trenkler. A revisitation of formulae for the moore–penrose inverse of modified matrices. Linear Algebra and Its Applications, 372:207–224, 2003.
- Bartlett et al. (2019) Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. CoRR, abs/1906.11300, 2019. URL http://arxiv.org/abs/1906.11300.
- Belkin et al. (2019) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/content/116/32/15849.
- Boucheron et al. (2005) Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
- Bousquet & Elisseeff (2001) O. Bousquet and A. Elisseeff. Stability and generalization. Journal Machine Learning Research, 2001.
- Bühlmann & Van De Geer (2011) Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
- El Karoui (2010) Noureddine El Karoui. The spectrum of kernel random matrices. arXiv e-prints, art. arXiv:1001.0492, Jan 2010.
- Hastie et al. (2019) Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv e-prints, art. arXiv:1903.08560, Mar 2019.
- Kutin & Niyogi (2002) S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. Technical report TR-2002-03, University of Chicago, 2002.
- Liang et al. (2019) Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels. arXiv e-prints, art. arXiv:1908.10292, Aug 2019.
- Liang et al. (2020) Tengyuan Liang, Alexander Rakhlin, et al. Just interpolate: Kernel “ridgeless” regression can generalize. Annals of Statistics, 48(3):1329–1347, 2020.
- Marchenko & Pastur (1967) V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mat. Sb. (N.S.), 72(114):4:457–483, 1967.
- Mei & Montanari (2019) Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv e-prints, art. arXiv:1908.05355, Aug 2019.
- Meyer (1973) Carl Meyer. Generalized inversion of modified matrices. SIAM J. Applied Math, 24:315–323, 1973.
- Micchelli (1986) C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constructive Approximation, 2:11–22, 1986.
- Mukherjee et al. (2006) Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006. ISSN 1572-9044. doi: 10.1007/s10444-004-7634-z. URL http://dx.doi.org/10.1007/s10444-004-7634-z.
- Poggio et al. (2004) T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428:419–422, March 2004.
- Poggio et al. (2019) T. Poggio, G. Kur, and A. Banburski. Double descent in the condition number. Technical report, MIT Center for Brains Minds and Machines, 2019.
- Poggio (2020) Tomaso Poggio. Stable foundations for learning. Center for Brains, Minds and Machines (CBMM) Memo No. 103, 2020.
- Rakhlin & Zhai (2018) Alexander Rakhlin and Xiyu Zhai. Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon. arXiv e-prints, art. arXiv:1812.11167, Dec 2018.
- Rosasco & Villa (2015) Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 1630–1638. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization.pdf.
- Shalev-Shwartz & Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014. ISBN 1107057132.
- Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, December 2010. ISSN 1532-4435.
- Steinwart & Christmann (2008) Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.
Appendix A Excess Risk, Generalization, and Stability
We use the same notation as introduced in Section 2 for the various quantities considered in this section. That is in the supervised learning setup is the loss incurred by hypothesis on the sample , and is the expected error of hypothesis . Since we are interested in different forms of stability, we will consider learning problems over the original training set , the leave one out training set , and the replace one training set
A.1 Replace one and leave one out algorithmic stability
Similar to the definition of expected stability in equation (3) of the main paper, we say an algorithm is cross validation replace one stable (in expectation), denoted as , if there exists such that
We can strengthen the above stability definition by introducing the notion of replace one algorithmic stability (in expectation) Bousquet & Elisseeff (2001). There exists such that for all ,
We make two observations:
First, if the loss is Lipschitz, that is if there exists such that for all
then replace one algorithmic stability implies stability with . Moreover, the same result holds if the loss is locally Lipschitz and there exists , such that almost surely. In this latter case the Lipschitz constant will depend on . Later, we illustrate this situation for the square loss.
Second, we have for all , and ,
This observation motivates the notion of leave one out algorithmic stability (in expectation) Bousquet & Elisseeff (2001)]
Clearly, leave one out algorithmic stability implies replace one algorithmic stability with and it implies also stability with .
A.2 Excess Risk and , Stability
We recall the statement of Lemma 5 in section 3 that bounds the excess risk using the stability of a solution.
Lemma 12 (Excess Risk & Stability)
For all ,
(17) |
In this section, two properties of ERM are useful, namely symmetry, and a form of unbiasedeness.
Symmetry.
A key property of ERM is that it is symmetric with respect to the data set , meaning that it does not depend on the order of the data in .
A second property relates the expected ERM with the minimum of expected risk.
ERM Bias.
The following inequality holds.
(18) |
To see this, note that
for all by definition of ERM, so that taking the expectation of both sides
for all . This implies
and hence (18) holds.
Remark 13
Note that the same argument gives more generally that
(19) |
Given the above premise, the proof of Lemma 5 is simple.
Proof [of Lemma 5] Adding and subtracting from the expected excess risk we have that
(20) |
and since is less or equal than zero, see (19), then
(21) |
Moreover, for all
and
Plugging these last two expressions in (21) and in (20) leads to (4).
We can prove a similar result relating excess risk with stability.
Lemma 14 (Excess Risk & Stability)
Given the above definitions, the following inequality holds for all ,
(22) |
Proof The first inequality is clear from adding and subtracting from the expected risk we have that
and recalling (19). The main step in the proof is showing that for all ,
(23) |
to be compared with the trivial equality, . To prove Equation (23), we have for all ,
where we used the fact that by the symmetry of the algorithm is the same for all . The proof is concluded noting that .
A.3 Discussion on Stability and Generalization
Below we discuss some more aspects of stability and its connection to other quantities in statistical learning theory.
Remark 15 ( stability in expectation and in probability)
In Mukherjee et al. (2006), stability is defined in probability, that is there exists , such that
Note that the absolute value is not needed for ERM since almost positivity holds Mukherjee et al. (2006), that is Then stability in probability and in expectation are clearly related and indeed equivalent for bounded loss functions. stability in expectation (3) is what we study in the following sections.
Remark 16 (Connection to uniform stability and other notions of stability)
Uniform stability, introduced by Bousquet & Elisseeff (2001), corresponds in our notation to the assumption that there exists such that for all , Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006).
Remark 17 ( Stability & Learnability)
A natural question is to which extent suitable notions of stability are not only sufficient but also necessary for controlling the excess risk of ERM. Classically, the latter is characterized in terms of a uniform version of the law of large numbers, which itself can be characterized in terms of suitable complexity measures of the hypothesis class. Uniform stability is too strong to characterize consistency while stability turns out to provide a suitably weak definition as shown in Mukherjee et al. (2006), see also Kutin & Niyogi (2002), Mukherjee et al. (2006). Indeed, a main result in Mukherjee et al. (2006) shows that stability is equivalent to consistency of ERM:
Theorem 18
Mukherjee et al. (2006) For ERM and bounded loss functions, stability in probability with converging to zero for is equivalent to consistency and generalization of ERM.
Remark 19 ( stability & in-sample/out-of-sample error)
Let ( is a data point drawn according to the same distribution) and the corresponding ERM solution , then (4) can be equivalently written as,
Thus stability measures how much the loss changes when we test on a point that is present in the training set and absent from it. In this view, it can be seen as an average measure of the difference between in-sample and out-of-sample error.
Remark 20 ( stability and generalization)
A common error measure is the (expected) generalization gap For non-ERM algorithms, stability by itself not sufficient to control this term, and further conditions are needed Mukherjee et al. (2006), since
The second term becomes for all ,
and hence is controlled by CV stability. The first term is called expected leave one out error in Mukherjee et al. (2006) and is controlled in ERM as , see Theorem 18 above.
Appendix B Stabililty of Linear Regression
We have a dataset and we want to find a mapping , that minimizes the empirical least squares risk. All interpolating solutions are of the form . Similarly, all interpolating solutions on the leave one out dataset can be written as . Here are the data matrices for the original and leave one out datasets respectively. We note that when and , we obtain the minimum norm interpolating solutions on the datasets and .
In this section we want to estimate the stability of the minimum norm solution to the ERM problem in the linear regression case. This is the case outlined in Remark 9 of the main paper. In order to prove Remark 9, we only need to combine Lemma 10 with the linear regression analogue of Lemma 11. We state and prove that result in this section. This result predicts a double descent curve for the norm of the pseudoinverse as found in practice, see Figure 3.
Lemma 21
Let be any interpolating least squares solutions on the full and leave one out datasets , then , and is minimized when , which corresponds to the minimum norm interpolating solutions .
Also,
(24) |
As mentioned before in section 2.1 of the main paper, linear regression can be viewed as a case of the kernel regression problem where , and the feature map is the identity map. The inner product and norms considered in this case are also the usual Euclidean inner product and -norm for vectors in . The notation denotes the Euclidean norm for vectors both in and . The usage of the norm should be clear from the context. Also, is the left operator norm for a matrix , that is .
We have samples in the training set for a linear regression problem, . We collect all the samples into a single matrix/vector , and . Then any interpolating ERM solution satisfies the linear equation
(25) |
Any interpolating solution can be written as:
(26) |

If we consider the leave one out training set we can find the minimum norm ERM solution for and as
(27) |
We can write as:
(28) |
where is a column vector representing the additive change to the column, i.e, , and is the th element of the canonical basis in (all the coefficients are zero but the th which is one). Thus is a matrix composed of all zeros apart from the column which is equal to .
We also have . Now per Lemma 10 we are interested in bounding the quantity . This simplifies to:
(29) |
In the above equation we make use of the fact that . We use an old formula (Meyer, 1973; Baksalary et al., 2003) to compute from . We use the development of pseudo-inverses of perturbed matrices in Meyer (1973). We see that is a vector in the column space of and is in the range space of (provided has full column rank), with . This means we can use Theorem 6 in Meyer (1973) (equivalent to formula 2.1 in Baksalary et al. (2003)) to obtain the expression for
(30) |
where , and , and for any non-zero vector .
(31) |
The above set of inequalities follows from the fact that the operator norm of a rank 1 matrix is given by , and by noticing that .
Also, from List 2 of Baksalary et al. (2003) we have that .