This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

For interpolating kernel machines, minimizing the norm of the ERM solution minimizes stability

Akshay Rangamani [email protected] Center for Brains, Minds, and Machines, MIT McGovern Institute for Brain Research, MIT Lorenzo Rosasco [email protected] Center for Brains, Minds, and Machines, MIT Tomaso Poggio [email protected] Center for Brains, Minds, and Machines, MIT McGovern Institute for Brain Research, MIT
Abstract

We study the average CVloo{\mbox{CV}_{loo}} stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm minimizes a bound on CVloo{\mbox{CV}_{loo}} stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error should be expected to follow a double descent curve.

1 Introduction

Statistical learning theory studies the learning properties of machine learning algorithms, and more fundamentally, the conditions under which learning from finite data is possible. In this context, classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures, such as combinatorial dimensions, covering numbers and Rademacher/Gaussian complexities (Shalev-Shwartz & Ben-David, 2014; Boucheron et al., 2005). Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data (Bousquet & Elisseeff, 2001; Kutin & Niyogi, 2002). In this view, the continuity of the process that maps data to estimators is crucial, rather than the complexity of the hypothesis space. Different notions of stability can be considered, depending on the data perturbation and metric considered (Kutin & Niyogi, 2002). Interestingly, the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other, and can be shown to be equivalent( Poggio et al. (2004) and Shalev-Shwartz et al. (2010)).

In modern machine learning overparameterized models, with a larger number of parameters than the size of the training data, have become common. The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process (Bühlmann & Van De Geer, 2011; Steinwart & Christmann, 2008). However, it was recently shown - first for deep networks (Zhang et al., 2017), and more recently for kernel methods (Belkin et al., 2019) - that learning is possible in the absence of regularization, i.e., when perfectly fitting/interpolating the data. Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding. Since learning using models that interpolate is not exclusive to deep neural networks, we study generalization in the presence of interpolation in the case of kernel methods. We study both linear and kernel least squares problems in this paper.

Our Contributions:

  • We characterize the generalization properties of interpolating solutions for linear and kernel least squares problems using a stability approach. While the (uniform) stability properties of regularized kernel methods are well known (Bousquet & Elisseeff, 2001), we study interpolating solutions of the unregularized ("ridgeless") regression problems.

  • We obtain an upper bound on the stability of interpolating solutions, and show that this upper bound is minimized by the minimum norm interpolating solution. This also means that among all interpolating solutions, the minimum norm solution has the best test error. In particular, the same conclusion is also true for gradient descent, since it converges to the minimal norm solution in the setting we consider, see e.g. Rosasco & Villa (2015).

  • Our stability bounds show that the average stability of the minimum norm solution is controlled by the condition number of the empirical kernel matrix. It is well known that the numerical stability of the least squares solution is governed by the condition number of the associated kernel matrix (see Poggio et al. (2019)). Our results show that the condition number also controls stability (and hence, test error) in a statistical sense.

Organization:

In section 2, we introduce basic ideas in statistical learning and empirical risk minimization, as well as the notation used in the rest of the paper. In section 3, we briefly recall some definitions of stability. In section 4, we study the stability of interpolating solutions to kernel least squares and show that the minimum norm solutions minimize an upper bound on the stability. In section 5 we discuss our results in the context of recent work on high dimensional regression. We conclude in section 6.

2 Statistical Learning and Empirical Risk Minimization

We begin by recalling the basic ideas in statistical learning theory. In this setting, XX is the space of features, YY is the space of targets or labels, and there is an unknown probability distribution μ\mu on the product space Z=X×YZ=X\times Y. In the following, we consider X=dX=\operatorname{\mathbb{R}}^{d} and Y=Y=\operatorname{\mathbb{R}}. The distribution μ\mu is fixed but unknown, and we are given a training set SS consisting of nn samples (thus |S|=n|S|=n) drawn i.i.d. from the probability distribution on ZnZ^{n}, S=(zi)i=1n=(𝐱i,yi)i=1n.S=(z_{i})_{i=1}^{n}=(\mathbf{x}_{i},y_{i})_{i=1}^{n}. Intuitively, the goal of supervised learning is to use the training set SS to “learn” a function fSf_{S} that evaluated at a new value 𝐱new\mathbf{x}_{new} should predict the associated value of ynewy_{new}, i.e. ynewfS(𝐱new).y_{new}\approx f_{S}(\mathbf{x}_{new}).

The loss is a function V:×Z[0,)V:{\cal F}\times Z\to[0,\infty), where {\cal F} is the space of measurable functions from XX to YY, that measures how well a function performs on a data point. We define a hypothesis space {\cal H}\subseteq{\cal F} where algorithms search for solutions. With the above notation, the expected risk of ff is defined as I[f]=𝔼zV(f,z)I[f]={\mathbb{E}}_{z}V(f,z) which is the expected loss on a new sample drawn according to the data distribution μ\mu. In this setting, statistical learning can be seen as the problem of finding an approximate minimizer of the expected risk given a training set SS. A classical approach to derive an approximate solution is empirical risk minimization (ERM) where we minimize the empirical risk IS[f]=1ni=1nV(f,zi)I_{S}[f]=\frac{1}{n}\sum_{i=1}^{n}V(f,z_{i}).

A natural error measure for our ERM solution fSf_{S} is the expected excess risk 𝔼S[I[fS]minfI[f]].{\mathbb{E}}_{S}[I[f_{S}]-\min_{f\in{\cal H}}I[f]]. Another common error measure is the expected generalization error/gap given by 𝔼S[I[fS]IS[fS]].{\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]]. These two error measures are closely related since, the expected excess risk is easily bounded by the expected generalization error (see Lemma 5).

2.1 Kernel Least Squares and Minimal Norm Solution

The focus in this paper is on the kernel least squares problem. We assume the loss function VV is the square loss, that is, V(f,z)=(yf(𝐱))2.V(f,z)=(y-f(\mathbf{x}))^{2}. The hypothesis space is assumed to be a reproducing kernel Hilbert space, defined by a positive definite kernel K:X×XK:X\times X\to\operatorname{\mathbb{R}} or an associated feature map Φ:X\Phi:X\to{\cal H}, such that K(𝐱,𝐱)=Φ(𝐱),Φ(𝐱)K(\mathbf{x},\mathbf{x}^{\prime})=\langle{\Phi(\mathbf{x})},{\Phi(\mathbf{x}^{\prime})}\rangle_{\cal H} for all 𝐱,𝐱X\mathbf{x},\mathbf{x}^{\prime}\in X, where ,\langle{\cdot},{\cdot}\rangle_{\cal H} is the inner product in {\cal H}. In this setting, functions are linearly parameterized, that is there exists ww\in{\cal H} such that f(𝐱)=w,Φ(𝐱)f(\mathbf{x})=\langle{w},{\Phi(\mathbf{x})}\rangle_{\cal H} for all xXx\in X.

The ERM problem typically has multiple solutions, one of which is the minimal norm solution:

fS=argminff,=argminf1ni=1n(f(𝐱i)yi)2.f_{S}^{\dagger}=\operatorname*{arg\,min}_{f\in{\cal M}}\left\lVert{f}\right\rVert_{\cal H},\quad\quad{\cal M}=\operatorname*{arg\,min}_{f\in{\cal H}}\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-y_{i})^{2}. (1)

Here \left\lVert{\cdot}\right\rVert_{\cal H} is the norm on {\cal H} induced by the inner product. The minimal norm solution can be shown to be unique and satisfy a representer theorem, that is for all 𝐱X\mathbf{x}\in X:

fS(𝐱)=i=1nK(𝐱,𝐱i)cS,i,𝐜S=𝐊𝐲f_{S}^{\dagger}(\mathbf{x})=\sum_{i=1}^{n}K(\mathbf{x},\mathbf{x}_{i})c_{S,i},\quad\quad\mathbf{c}_{S}={\mathbf{K}}^{\dagger}{\mathbf{y}} (2)

where 𝐜S=(cS,1,,cS,n),𝐲=(y1yn)n\mathbf{c}_{S}=(c_{S,1},\dots,c_{S,n}),\mathbf{y}=(y_{1}\ldots y_{n})\in\operatorname{\mathbb{R}}^{n}, 𝐊{\mathbf{K}} is the nn by nn matrix with entries 𝐊ij=K(𝐱i,𝐱j){\mathbf{K}}_{ij}=K(\mathbf{x}_{i},\mathbf{x}_{j}), i,j=1,,ni,j=1,\dots,n, and 𝐊{\mathbf{K}}^{\dagger} is the Moore-Penrose pseudoinverse of 𝐊{\mathbf{K}}. If we assume ndn\leq d and that we have nn linearly independent data features, that is the rank of 𝐗{\mathbf{X}} is nn, then it is possible to show that for many kernels one can replace 𝐊{\mathbf{K}}^{\dagger} by 𝐊1{\mathbf{K}}^{-1} (see Remark 2). Note that invertibility is necessary and sufficient for interpolation. That is, if 𝐊{\mathbf{K}} is invertible, fS(𝐱i)=yif_{S}^{\dagger}(\mathbf{x}_{i})=y_{i} for all i=1,,ni=1,\dots,n, in which case the training error in (1) is zero.

Remark 1 (Pseudoinverse for underdetermined linear systems)

A simple yet relevant example are linear functions f(𝐱)=𝐰𝐱f(\mathbf{x})=\mathbf{w}^{\top}\mathbf{x}, that correspond to =d{\cal H}=\operatorname{\mathbb{R}}^{d} and Φ\Phi the identity map. If the rank of 𝐗d×n{\mathbf{X}}\in\operatorname{\mathbb{R}}^{d\times n} is nn, then any interpolating solution 𝐰S\mathbf{w}_{S} satisfies 𝐰S𝐱i=yi\mathbf{w}_{S}^{\top}\mathbf{x}_{i}=y_{i} for all i=1,,ni=1,\dots,n, and the minimal norm solution, also called Moore-Penrose solution, is given by (𝐰S)=𝐲𝐗(\mathbf{w}_{S}^{\dagger})^{\top}={\mathbf{y}}^{\top}{\mathbf{X}}^{\dagger} where the pseudoinverse 𝐗{\mathbf{X}}^{\dagger} takes the form 𝐗=𝐗(𝐗𝐗)1.{\mathbf{X}}^{\dagger}={\mathbf{X}}^{\top}({\mathbf{X}}{\mathbf{X}}^{\top})^{-1}.

Remark 2 (Invertibility of translation invariant kernels)

Translation invariant kernels are a family of kernel functions given by K(𝐱1,𝐱2)=k(𝐱1𝐱2)K(\mathbf{x}_{1},\mathbf{x}_{2})=k(\mathbf{x}_{1}-\mathbf{x}_{2}) where kk is an even function on d\operatorname{\mathbb{R}}^{d}. Translation invariant kernels are Mercer kernels (positive semidefinite) if the Fourier transform of k()k(\cdot) is non-negative. For Radial Basis Function kernels (K(𝐱1,𝐱2)=k(𝐱1𝐱2)K(\mathbf{x}_{1},\mathbf{x}_{2})=k(||\mathbf{x}_{1}-\mathbf{x}_{2}||)) we have the additional property due to Theorem 2.3 of Micchelli (1986) that for distinct points 𝐱1,𝐱2,,𝐱nd\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}\in\operatorname{\mathbb{R}}^{d} the kernel matrix 𝐊{\mathbf{K}} is non-singular and thus invertible.

The above discussion is directly related to regularization approaches.

Remark 3 (Stability and Tikhonov regularization)

Tikhonov regularization is used to prevent potential unstable behaviors. In the above setting, it corresponds to replacing Problem (1) by minf1ni=1n(f(𝐱i)yi)2+λf2\min_{f\in{\cal H}}\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-y_{i})^{2}+\lambda\left\lVert{f}\right\rVert^{2}_{\cal H} where the corresponding unique solution is given by fSλ(𝐱)=i=1nK(𝐱,𝐱i)ci,𝐜=(𝐊+λ𝐈n)1𝐲.f_{S}^{\lambda}(\mathbf{x})=\sum_{i=1}^{n}K(\mathbf{x},\mathbf{x}_{i})c_{i},\quad\quad\mathbf{c}=({\mathbf{K}}+\lambda\mathbf{I}_{n})^{-1}{\mathbf{y}}. In contrast to ERM solutions, the above approach prevents interpolation. The properties of the corresponding estimator are well known. In this paper, we complement these results focusing on the case λ0\lambda\to 0.

Finally, we end by recalling the connection between minimal norm and the gradient descent.

Remark 4 (Minimum norm and gradient descent)

In our setting, it is well know the both batch and stochastic gradient iterations converge exactly to the minimal norm solution, when multiple solutions exist, see e.g. Rosasco & Villa (2015). Thus, a study of the properties of minimal norm solutions explain the properties of the solution to which gradient descent converges. In particular, when ERM has multiple interpolating solutions, gradient descent converges to a solution that minimizes a bound on stability, as we show next.

3 Error Bounds via Stability

In this section, we recall basic results relating the learning and stability properties of Empirical Risk Minimization (ERM). Throughout the paper, we assume that ERM achieves a minimum, albeit the extension to almost minimizer is possible (Mukherjee et al., 2006) and important for exponential-type loss functions (Poggio, 2020). We do not assume the expected risk to achieve a minimum. Since we will be considering leave-one-out stability in this section, we look at solutions to ERM over the complete training set S={z1,z2,,zn}S=\{z_{1},z_{2},\dots,z_{n}\} and the leave one out training set Si={z1,z2,,zi1,zi+1,,zn}S_{i}=\{z_{1},z_{2},\dots,z_{i-1},z_{i+1},\dots,z_{n}\}

The excess risk of ERM can be easily related to its stability properties. Here, we follow the definition laid out in Mukherjee et al. (2006) and say that an algorithm is Cross-Validation leave-one-out (CVloo{\mbox{CV}_{loo}}) stable in expectation, if there exists βCV>0\beta_{CV}>0 such that for all i=1,,ni=1,\dots,n,

𝔼S[V(fSi,zi)V(fS,zi)]βCV.{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]\leq\beta_{CV}. (3)

This definition is justified by the following result that bounds the excess risk of a learning algorithm by its average CVloo{\mbox{CV}_{loo}} stability.

Lemma 5 (Excess Risk & CVloo{\mbox{CV}_{loo}} Stability)

For all i=1,,ni=1,\dots,n,

𝔼S[I[fSi]inffI[f]]𝔼S[V(fSi,zi)V(fS,zi)].{\mathbb{E}}_{S}[I[f_{S_{i}}]-\inf_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]. (4)
Remark 6 (Connection to uniform stability and other notions of stability)

Uniform stability, introduced by Bousquet & Elisseeff (2001), corresponds in our notation to the assumption that there exists βu>0\beta_{u}>0 such that for all i=1,,ni=1,\dots,n, supzZ|V(fSi,z)V(fS,z)|βu.\sup_{z\in Z}|V(f_{S_{i}},z)-V(f_{S},z)|\leq\beta_{u}. Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006).

We present the proof of Lemma 5 in Appendix A.2 due to lack of space. In Appendix A, we also discuss other definitions of stability and their connections to concepts in statistical learning theory like generalization and learnability.

4 CVloo{\mbox{CV}_{loo}} Stability of Kernel Least Squares

In this section we analyze the expected CVloo{\mbox{CV}_{loo}} stability of interpolating solutions to the kernel least squares problem, and obtain an upper bound on their stability. We show that this upper bound on the expected CVloo{\mbox{CV}_{loo}} stability is smallest for the minimal norm interpolating solution (1) when compared to other interpolating solutions to the kernel least squares problem.

We have a dataset S={(𝐱i,yi)}i=1nS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} and we want to find a mapping ff\in{\cal H}, that minimizes the empirical least squares risk. Here {\cal H} is a reproducing kernel hilbert space (RKHS) defined by a positive definite kernel K:X×XK:X\times X\to\operatorname{\mathbb{R}}. All interpolating solutions are of the form f^S()=j=1nc^S,jK(𝐱j,)\hat{f}_{S}(\cdot)=\sum_{j=1}^{n}\hat{c}_{S,j}K(\mathbf{x}_{j},\cdot), where 𝐜^S=𝐊𝐲+(𝐈𝐊𝐊)𝐯\hat{\mathbf{c}}_{S}={\mathbf{K}}^{\dagger}\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})\mathbf{v}. Similarly, all interpolating solutions on the leave one out dataset SiS_{i} can be written as f^Si()=j=1,jinc^Si,jK(𝐱j,)\hat{f}_{S_{i}}(\cdot)=\sum_{j=1,j\neq i}^{n}\hat{c}_{S_{i},j}K(\mathbf{x}_{j},\cdot), where 𝐜^Si=𝐊Si𝐲i+(𝐈𝐊Si𝐊Si)𝐯i\hat{\mathbf{c}}_{S_{i}}={\mathbf{K}}_{S_{i}}^{\dagger}\mathbf{y}_{i}+(\mathbf{I}-{\mathbf{K}}_{S_{i}}^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}. Here 𝐊,𝐊Si{\mathbf{K}},{\mathbf{K}}_{S_{i}} are the empirical kernel matrices on the original and leave one out datasets respectively. We note that when 𝐯=𝟎\mathbf{v}=\mathbf{0} and 𝐯i=𝟎\mathbf{v}_{i}=\mathbf{0}, we obtain the minimum norm interpolating solutions on the datasets SS and SiS_{i}.

Theorem 7 (Main Theorem)

Consider the kernel least squares problem with a bounded kernel and bounded outputs yy, that is there exist κ,M>0\kappa,M>0 such that

K(𝐱,𝐱)κ,|y|M,K(\mathbf{x},\mathbf{x}^{\prime})\leq\kappa,\quad\quad\quad|y|\leq M, (5)

almost surely. Then for any interpolating solutions f^Si,f^S\hat{f}_{S_{i}},\hat{f}_{S},

𝔼S[V(f^Si,zi)V(f^S,zi)]βCV(𝐊,𝐲,𝐯,𝐯i){\mathbb{E}}_{S}[V(\hat{f}_{S_{i}},z_{i})-V(\hat{f}_{S},z_{i})]\leq\beta_{CV}({\mathbf{K}}^{\dagger},\mathbf{y},\mathbf{v},\mathbf{v}_{i}) (6)

This bound βCV\beta_{CV} is minimized when 𝐯=𝐯i=𝟎\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}, which corresponds to the minimum norm interpolating solutions fS,fSif_{S}^{\dagger},f_{S_{i}}^{\dagger}. For the minimum norm solutions we have βCV=C1β1+C2β2\beta_{CV}=C_{1}\beta_{1}+C_{2}\beta_{2}, where β1=𝔼S[𝐊12op𝐊op×cond(𝐊)×𝐲]\beta_{1}={\mathbb{E}}_{S}\left[||{\mathbf{K}}^{\frac{1}{2}}||_{op}||{\mathbf{K}}^{\dagger}||_{op}\times\textrm{cond}({\mathbf{K}})\times||\mathbf{y}||\right] and, β2=𝔼S[𝐊12op2𝐊op2×(cond(𝐊))2×𝐲2]\beta_{2}={\mathbb{E}}_{S}\left[||{\mathbf{K}}^{\frac{1}{2}}||_{op}^{2}||{\mathbf{K}}^{\dagger}||_{op}^{2}\times(\textrm{cond}({\mathbf{K}}))^{2}\times||\mathbf{y}||^{2}\right], and C1,C2C_{1},C_{2} are absolute constants that do not depend on either dd or nn.

In the above theorem 𝐊op||{\mathbf{K}}||_{op} refers to the operator norm of the kernel matrix 𝐊{\mathbf{K}}, 𝐲||\mathbf{y}|| refers to the standard 2\ell_{2} norm for 𝐲n\mathbf{y}\in\operatorname{\mathbb{R}}^{n}, and cond(𝐊)\textrm{cond}({\mathbf{K}}) is the condition number of the matrix 𝐊{\mathbf{K}}.

We can combine the above result with Lemma 5 to obtain the following bound on excess risk for minimum norm interpolating solutions to the kernel least squares problem:

Corollary 8

The excess risk of the minimum norm interpolating kernel least squares solution can be bounded as:

𝔼S[I[fSi]inffI[f]]C1β1+C2β2{\mathbb{E}}_{S}\left[I[f_{S_{i}}^{\dagger}]-\inf_{f\in{\cal H}}I[f]\right]\leq C_{1}\beta_{1}+C_{2}\beta_{2}

where β1,β2\beta_{1},\beta_{2} are as defined previously.

Remark 9 (Underdetermined Linear Regression)

In the case of underdetermined linear regression, ie, linear regression where the dimensionality is larger than the number of samples in the training set, we can prove a version of Theorem 7 with β1=𝔼S[𝐗op𝐲]\beta_{1}={\mathbb{E}}_{S}\left[\left\lVert{{\mathbf{X}}^{\dagger}}\right\rVert_{op}\left\lVert{\mathbf{y}}\right\rVert\right] and β2=𝔼S[𝐗op2𝐲2]\beta_{2}={\mathbb{E}}_{S}\left[\left\lVert{{\mathbf{X}}^{\dagger}}\right\rVert_{op}^{2}\left\lVert{\mathbf{y}}\right\rVert^{2}\right]. Due to space constraints, we present the proof of the results in the linear regression case in Appendix B.

4.1 Key lemmas

In order to prove Theorem 7 we make use of the following lemmas to bound the CVloo{\mbox{CV}_{loo}} stability using the norms and the difference of the solutions.

Lemma 10

Under assumption (5), for all i=1.,ni=1.\dots,n, it holds that

𝔼S[V(f^Si,zi)V(f^S,zi)]𝔼S[(2M+κ(f^S+f^Si))×κf^Sf^Si]{\mathbb{E}}_{S}[V(\hat{f}_{S_{i}},z_{i})-V(\hat{f}_{S},z_{i})]\leq{\mathbb{E}}_{S}\left[\left(2M+\kappa\left(\left\lVert{\hat{f}_{S}}\right\rVert_{\cal H}+\left\lVert{\hat{f}_{S_{i}}}\right\rVert_{\cal H}\right)\right)\times\kappa\left\lVert{\hat{f}_{S}-\hat{f}_{S_{i}}}\right\rVert_{\cal H}\right]

Proof  We begin, recalling that the square loss is locally Lipschitz, that is for all y,a,ay,a,a^{\prime}\in\operatorname{\mathbb{R}}, with

|(ya)2(ya)2|(2|y|+|a|+|a|))|aa|.|(y-a)^{2}-(y-a^{\prime})^{2}|\leq(2|y|+|a|+|a^{\prime}|))|a-a^{\prime}|.

If we apply this result to f,ff,f^{\prime} in a RKHS {\cal H},

|(yf(𝐱))2(yf(𝐱))2|κ(2M+κ(f+f))ff.|(y-f(\mathbf{x}))^{2}-(y-f^{\prime}(\mathbf{x}))^{2}|\leq\kappa(2M+\kappa\left(\left\lVert{f}\right\rVert_{\cal H}+\left\lVert{f^{\prime}}\right\rVert_{\cal H}\right))\left\lVert{f-f^{\prime}}\right\rVert_{\cal H}.

using the basic properties of a RKHS that for all ff\in{\cal H}

|f(𝐱)|fκf|f(\mathbf{x})|\leq\left\lVert{f}\right\rVert_{\infty}\leq\kappa\left\lVert{f}\right\rVert_{\cal H} (7)

In particular, we can plug f^Si\hat{f}_{S_{i}} and f^S\hat{f}_{S} into the above inequality, and the almost positivity of ERM (Mukherjee et al., 2006) will allow us to drop the absolute value on the left hand side. Finally the desired result follows by taking the expectation over SS.  


Now that we have bounded the CVloo{\mbox{CV}_{loo}} stability using the norms and the difference of the solutions, we can find a bound on the difference between the solutions to the kernel least squares problem. This is our main stability estimate.

Lemma 11

Let f^S,f^Si\hat{f}_{S},\hat{f}_{S_{i}} be any interpolating kernel least squares solutions on the full and leave one out datasets (as defined at the top of this section), then f^Sf^SiBCV(𝐊,𝐲,𝐯,𝐯i)\left\lVert{\hat{f}_{S}-\hat{f}_{S_{i}}}\right\rVert_{{\cal H}}\leq B_{CV}({\mathbf{K}}^{\dagger},\mathbf{y},\mathbf{v},\mathbf{v}_{i}), and BCVB_{CV} is minimized when 𝐯=𝐯i=𝟎\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}, which corresponds to the minimum norm interpolating solutions fS,fSif_{S}^{\dagger},f_{S_{i}}^{\dagger}.

Also,

fSfSi𝐊12op𝐊op×cond(𝐊)×𝐲\left\lVert{f_{S}^{\dagger}-f_{S_{i}}^{\dagger}}\right\rVert_{{\cal H}}\leq\left\lVert{{\mathbf{K}}^{\frac{1}{2}}}\right\rVert_{op}\left\lVert{{\mathbf{K}}^{\dagger}}\right\rVert_{op}\times\textrm{cond}({\mathbf{K}})\times\left\lVert{\mathbf{y}}\right\rVert (8)

Since the minimum norm interpolating solutions minimize both f^S+f^Si\left\lVert{\hat{f}_{S}}\right\rVert_{\cal H}+\left\lVert{\hat{f}_{S_{i}}}\right\rVert_{\cal H} and f^Sf^Si\left\lVert{\hat{f}_{S}-\hat{f}_{S_{i}}}\right\rVert_{\cal H} (from lemmas 10, 11), we can put them together to prove theorem 7. In the following section we provide the proof of Lemma 11.

Simulation:

In order to illustrate that the minimum norm interpolating solution is the best performing interpolating solution we ran a simple experiment on a linear regression problem. We synthetically generated data from a linear model 𝐲=𝐰𝐗\mathbf{y}=\mathbf{w}^{\top}{\mathbf{X}}, where 𝐗d×n{\mathbf{X}}\in\operatorname{\mathbb{R}}^{d\times n} was i.i.d 𝒩(0,1)\mathcal{N}(0,1). The dimension of the data was d=1000d=1000 and there were n=200n=200 samples in the training dataset. A held out dataset of 5050 samples was used to compute the test mean squared error (MSE). Interpolating solutions were computed as 𝐰^=𝐲𝐗+𝐯(𝐈𝐗𝐗)\hat{\mathbf{w}}^{\top}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}) and the norm of 𝐯\mathbf{v} was varied to obtain the plot. The results are shown in Figure 1, where we can see that the training loss is 0 for all interpolants, but test MSE increases as 𝐯||\mathbf{v}|| increases, with (𝐰)=𝐲𝐗(\mathbf{w}^{\dagger})^{\top}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger} having the best performance. The figure reports results averaged over 100100 trials.

Refer to caption
Figure 1: Plot of train and test mean squared error (MSE) vs distance between an interpolating solution 𝐰^\hat{\mathbf{w}} and the minimum norm interpolant 𝐰\mathbf{w}^{\dagger} of a linear regression problem. Data was synthetically generated as 𝐲=𝐰𝐗\mathbf{y}=\mathbf{w}^{\top}{\mathbf{X}}, where 𝐗d×n{\mathbf{X}}\in\operatorname{\mathbb{R}}^{d\times n} with i.i.d 𝒩(0,1)\mathcal{N}(0,1) entries and d=1000,n=200d=1000,n=200. Other interpolating solutions were computed as 𝐰^=𝐲𝐗+𝐯(𝐈𝐗𝐗)\hat{\mathbf{w}}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}) and the norm of 𝐯\mathbf{v} was varied to obtain the plot. Train MSE is 0 for all interpolants, but test MSE increases as 𝐯||\mathbf{v}|| increases, with 𝐰\mathbf{w}^{\dagger} having the best performance. This plot represents results averaged over 100100 trials.

4.2 Proof of Lemma 11

We can write any interpolating solution to the kernel regression problem as f^S(𝐱)=i=1nc^S,iK(𝐱i,𝐱)\hat{f}_{S}(\mathbf{x})=\sum_{i=1}^{n}\hat{c}_{S,i}K(\mathbf{x}_{i},\mathbf{x}) where 𝐜^S=𝐊𝐲+(𝐈𝐊𝐊)𝐯\hat{\mathbf{c}}_{S}={\mathbf{K}}^{\dagger}\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})\mathbf{v}, and 𝐊n×n{\mathbf{K}}\in\operatorname{\mathbb{R}}^{n\times n} is the kernel matrix KK on SS and 𝐯\mathbf{v} is any vector in n\operatorname{\mathbb{R}}^{n}. i.e. 𝐊ij=K(𝐱i,𝐱j){\mathbf{K}}_{ij}=K(\mathbf{x}_{i},\mathbf{x}_{j}), and 𝐲n\mathbf{y}\in\operatorname{\mathbb{R}}^{n} is the vector 𝐲=[y1yn]\mathbf{y}=[y_{1}\ldots y_{n}]^{\top}.

Similarly, the coefficient vector for the corresponding interpolating solution to the problem over the leave one out dataset SiS_{i} is 𝐜^Si=(𝐊Si)𝐲i+(𝐈(𝐊Si)𝐊Si)𝐯i\hat{\mathbf{c}}_{S_{i}}=({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{y}_{i}+(\mathbf{I}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}. Where 𝐲i=[y1,,0,yn]\mathbf{y}_{i}=[y_{1},\ldots,0,\ldots y_{n}]^{\top} and 𝐊Si{\mathbf{K}}_{S_{i}} is the kernel matrix 𝐊{\mathbf{K}} with the ithi^{\textrm{th}} row and column set to zero, which is the kernel matrix for the leave one out training set.

We define 𝐚=[K(𝐱1,𝐱i),,K(𝐱n,𝐱i)]n\mathbf{a}=[-K(\mathbf{x}_{1},\mathbf{x}_{i}),\ldots,-K(\mathbf{x}_{n},\mathbf{x}_{i})]^{\top}\in\operatorname{\mathbb{R}}^{n} and 𝐛n\mathbf{b}\in\operatorname{\mathbb{R}}^{n} as a one-hot column vector with all zeros apart from the ithi^{\textrm{th}} component which is 11. Let 𝐚=𝐚+K(𝐱i,𝐱i)𝐛\mathbf{a}_{*}=\mathbf{a}+K(\mathbf{x}_{i},\mathbf{x}_{i})\mathbf{b}. Then, we have:

𝐊=𝐊+𝐚𝐛𝐊Si=𝐊+𝐛𝐚\begin{split}{\mathbf{K}}_{*}&={\mathbf{K}}+\mathbf{a}\mathbf{b}^{\top}\\ {\mathbf{K}}_{S_{i}}&={\mathbf{K}}_{*}+\mathbf{b}\mathbf{a}_{*}^{\top}\end{split} (9)

That is, we can write 𝐊Si{\mathbf{K}}_{S_{i}} as a rank-2 update to 𝐊{\mathbf{K}}. This can be verified by simple algebra, and using the fact that KK is a symmetric kernel. Now we are interested in bounding f^Sf^Si||\hat{f}_{S}-\hat{f}_{S_{i}}||_{\cal H}. For a function h()=i=1mpiK(𝐱i,)h(\cdot)=\sum_{i=1}^{m}p_{i}K(\mathbf{x}_{i},\cdot)\in{\cal H} we have h=𝐩𝐊𝐩=𝐊12𝐩||h||_{\cal H}=\sqrt{\mathbf{p}^{\top}{\mathbf{K}}\mathbf{p}}=||{\mathbf{K}}^{\frac{1}{2}}\mathbf{p}||. So we have:

f^Sf^Si=𝐊12(𝐜^S𝐜^Si)=𝐊12(𝐊𝐲+(𝐈𝐊𝐊)𝐯(𝐊Si)𝐲i(𝐈(𝐊Si)𝐊Si)𝐯i)=||𝐊12(𝐊𝐲(𝐊Si)𝐲+yi(𝐊Si)𝐛+(𝐈𝐊𝐊)(𝐯𝐯i)+(𝐊𝐊(𝐊Si)𝐊Si)𝐯i)||=𝐊12[(𝐊(𝐊Si))𝐲+(𝐈𝐊𝐊)(𝐯𝐯i)+(𝐊𝐊(𝐊Si)𝐊Si)𝐯i]\begin{split}||\hat{f}_{S}-\hat{f}_{S_{i}}||_{\cal H}&=||{\mathbf{K}}^{\frac{1}{2}}(\hat{\mathbf{c}}_{S}-\hat{\mathbf{c}}_{S_{i}})||\\ &=||{\mathbf{K}}^{\frac{1}{2}}({\mathbf{K}}^{\dagger}\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})\mathbf{v}-({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{y}_{i}-(\mathbf{I}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i})||\\ &=||{\mathbf{K}}^{\frac{1}{2}}({\mathbf{K}}^{\dagger}\mathbf{y}-({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{y}+y_{i}({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{b}\\ &+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})+({\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i})||\\ &=||{\mathbf{K}}^{\frac{1}{2}}[({\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger})\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})+({\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}]||\end{split} (10)

Here we make use of the fact that (𝐊Si)𝐛=𝟎({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{b}=\mathbf{0}. If 𝐊{\mathbf{K}} has full rank (as in Remark 2), we see that 𝐛\mathbf{b} lies in the column space of 𝐊{\mathbf{K}} and 𝐚\mathbf{a} lies in the column space of 𝐊{\mathbf{K}}^{\top}. Furthermore, β=1+𝐛𝐊𝐚==0\beta=1+\mathbf{b}^{\top}{\mathbf{K}}^{\dagger}\mathbf{a}==0. Using Theorem 6 in (Meyer, 1973) (equivalent to formula 2.1 of Baksalary et al. (2003)) with 𝐤=𝐊𝐚,𝐡=𝐛𝐊\mathbf{k}={\mathbf{K}}^{\dagger}\mathbf{a},\mathbf{h}=\mathbf{b}^{\top}{\mathbf{K}}^{\dagger}, we obtain:

(𝐊)=𝐊𝐤𝐤𝐊𝐊𝐡𝐡+(𝐤𝐊𝐡)𝐤𝐡=𝐊𝐤𝐤𝐊𝐊𝐡𝐡𝐤𝐡=𝐊𝐊𝐡𝐡\begin{split}({\mathbf{K}}_{*})^{\dagger}&={\mathbf{K}}^{\dagger}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{K}}^{\dagger}-{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}+(\mathbf{k}^{\dagger}{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{k}\mathbf{h}\\ &={\mathbf{K}}^{\dagger}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{K}}^{\dagger}-{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}-\mathbf{k}\mathbf{h}\\ &={\mathbf{K}}^{\dagger}-{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}\end{split} (11)

Above, we use the fact that the operator norm of a rank 1 matrix is given by 𝐮𝐯op=𝐮×𝐯||\mathbf{u}\mathbf{v}^{\top}||_{op}=||\mathbf{u}||\times||\mathbf{v}|| and that 𝐤=𝐛\mathbf{k}=-\mathbf{b}. Also, using the corresponding formula from List 2 of Baksalary et al. (2003), we have 𝐊𝐊=𝐊𝐊𝐤𝐤{\mathbf{K}}_{*}^{\dagger}{\mathbf{K}}_{*}={\mathbf{K}}^{\dagger}{\mathbf{K}}-\mathbf{k}\mathbf{k}^{\dagger}.

Next, we see that for 𝐊Si{\mathbf{K}}_{S_{i}}, 𝐚\mathbf{a}_{*} lies in the column space of 𝐊{\mathbf{K}}_{*}^{\top}, and 𝐛\mathbf{b} does not lie in the column space of 𝐊{\mathbf{K}}_{*}. This means we can use Theorem 5 in (Meyer, 1973) (equivalent to formula 2.4 in (Baksalary et al., 2003)) to obtain the expression for (𝐊Si)({\mathbf{K}}_{S_{i}})^{\dagger}

(𝐊Si)=𝐊ν1(ϕ𝐊𝐞𝐞+η𝐝𝐟)+ν1(λ𝐊𝐞𝐟λ𝐝𝐞)({\mathbf{K}}_{S_{i}})^{\dagger}={\mathbf{K}}_{*}^{\dagger}-\nu^{-1}(\phi{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{e}^{\top}+\eta\mathbf{d}\mathbf{f}^{\top})+\nu^{-1}(\lambda{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{f}^{\top}-\lambda\mathbf{d}\mathbf{e}^{\top}) (12)

Also, using the corresponding formula from List 2 of Baksalary et al. (2003), we have (𝐊Si)𝐊Si=𝐊𝐊({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}}={\mathbf{K}}_{*}^{\dagger}{\mathbf{K}}_{*}, which implies that 𝐊𝐊(𝐊Si)𝐊Si=𝐤𝐤{\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}}=\mathbf{k}\mathbf{k}^{\dagger}.

Now let us define all the terms in equation  (12). We omit some algebraic simplification to save some space.

𝐝=𝐊𝐛=𝐊(𝐈𝐡𝐡)𝐛𝐞=(𝐊)𝐚=(𝐡𝐡𝐈)𝐛𝐟=(𝐈𝐊𝐊)𝐛=𝐡𝐡𝐛λ=1+𝐚𝐊𝐛=1+𝐞𝐛=𝐛𝐡𝐡𝐛ϕ=𝐟𝐟=𝐛𝐡𝐡𝐛=λη=𝐞𝐞=1λν=λ2+ηϕ=λ\begin{split}\mathbf{d}&={\mathbf{K}}_{*}^{\dagger}\mathbf{b}={\mathbf{K}}^{\dagger}(\mathbf{I}-\mathbf{h}^{\dagger}\mathbf{h})\mathbf{b}\\ \mathbf{e}&=({\mathbf{K}}_{*}^{\dagger})^{\top}\mathbf{a}_{*}=(\mathbf{h}^{\dagger}\mathbf{h}-\mathbf{I})\mathbf{b}\\ \mathbf{f}&=(\mathbf{I}-{\mathbf{K}}_{*}{\mathbf{K}}_{*}^{\dagger})\mathbf{b}=\mathbf{h}^{\dagger}\mathbf{h}\mathbf{b}\\ \lambda&=1+\mathbf{a}_{*}^{\top}{\mathbf{K}}_{*}^{\dagger}\mathbf{b}=1+\mathbf{e}^{\top}\mathbf{b}=\mathbf{b}^{\top}\mathbf{h}^{\dagger}\mathbf{h}\mathbf{b}\\ \phi&=\mathbf{f}^{\top}\mathbf{f}=\mathbf{b}^{\top}\mathbf{h}^{\dagger}\mathbf{h}\mathbf{b}=\lambda\\ \eta&=\mathbf{e}^{\top}\mathbf{e}=1-\lambda\\ \nu&=\lambda^{2}+\eta\phi=\lambda\end{split} (13)

This means we can simplify equation  (12) as

𝐊(𝐊Si)=ν1(ϕ𝐊𝐞𝐞+η𝐝𝐟)ν1(λ𝐊𝐞𝐟λ𝐝𝐞)=λ1(λ𝐊𝐞(𝐞𝐟)+η𝐝𝐟+λ𝐝𝐞)=𝐊𝐞(𝐞𝐟)+𝐝(𝐞𝐟)+λ1𝐝𝐟\begin{split}{\mathbf{K}}_{*}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger}&=\nu^{-1}(\phi{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{e}^{\top}+\eta\mathbf{d}\mathbf{f}^{\top})-\nu^{-1}(\lambda{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{f}^{\top}-\lambda\mathbf{d}\mathbf{e}^{\top})\\ &=\lambda^{-1}\left(\lambda{\mathbf{K}}_{*}^{\dagger}\mathbf{e}(\mathbf{e}-\mathbf{f})^{\top}+\eta\mathbf{d}\mathbf{f}^{\top}+\lambda\mathbf{d}\mathbf{e}^{\top}\right)\\ &={\mathbf{K}}_{*}^{\dagger}\mathbf{e}(\mathbf{e}-\mathbf{f})^{\top}+\mathbf{d}(\mathbf{e}-\mathbf{f})^{\top}+\lambda^{-1}\mathbf{d}\mathbf{f}^{\top}\end{split} (14)

Putting together  (11),  (12) (and after some algebraic simplification) we get:

𝐊(𝐊Si)=𝐊𝐊+𝐊(𝐊Si)=𝐊𝐡𝐡+λ1𝐊(𝐈𝐡𝐡)𝐛𝐛𝐡𝐡=(𝐊ii)1𝐡𝐡\begin{split}{\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger}&={\mathbf{K}}^{\dagger}-{\mathbf{K}}_{*}\dagger+{\mathbf{K}}_{*}\dagger-({\mathbf{K}}_{S_{i}})^{\dagger}\\ &={\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}+\lambda^{-1}{\mathbf{K}}^{\dagger}(\mathbf{I}-\mathbf{h}^{\dagger}\mathbf{h})\mathbf{b}\mathbf{b}^{\top}\mathbf{h}^{\dagger}\mathbf{h}\\ &=({\mathbf{K}}^{\dagger}_{ii})^{-1}\mathbf{h}^{\top}\mathbf{h}\end{split} (15)

Plugging in these calculations into equation 10 we get:

f^Sf^Si=𝐊12[(𝐊(𝐊Si))𝐲+(𝐈𝐊𝐊)(𝐯𝐯i)+(𝐊𝐊(𝐊Si)𝐊Si)𝐯i]𝐊12op((𝐊(𝐊Si))𝐲+(𝐈𝐊𝐊)(𝐯𝐯i)+𝐤𝐤𝐯i)𝐊12op(B0+2𝐯𝐯i+𝐯i)\begin{split}||\hat{f}_{S}-\hat{f}_{S_{i}}||_{\cal H}&=||{\mathbf{K}}^{\frac{1}{2}}[({\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger})\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})+({\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}]||\\ &\leq||{\mathbf{K}}^{\frac{1}{2}}||_{op}\left(||({\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger})\mathbf{y}||+||(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})||+||\mathbf{k}\mathbf{k}^{\dagger}\mathbf{v}_{i}||\right)\\ &\leq||{\mathbf{K}}^{\frac{1}{2}}||_{op}(B_{0}+2||\mathbf{v}-\mathbf{v}_{i}||+||\mathbf{v}_{i}||)\end{split} (16)

We see that the right hand side is minimized when 𝐯=𝐯i=𝟎\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}. We have also computed B0=𝐊op×cond(𝐊)×𝐲B_{0}=||{\mathbf{K}}^{\dagger}||_{op}\times cond({\mathbf{K}})\times||\mathbf{y}||, which concludes the proof of Lemma 11.

5 Remark and Related Work

In the previous section we obtained bounds on the CVloo{\mbox{CV}_{loo}} stability of interpolating solutions to the kernel least squares problem. Our kernel least squares results can be compared with stability bounds for regularized ERM (see Remark 3). Regularized ERM has a strong stability guarantee in terms of a uniform stability bound which turns out to be inversely proportional to the regularization parameter λ\lambda and the sample size nn (Bousquet & Elisseeff, 2001). However, this estimate becomes vacuous as λ0\lambda\to 0. In this paper, we establish a bound on average stability, and show that this bound is minimized when the minimum norm ERM solution is chosen. We study average stability since one can expect worst case scenarios where the minimum norm is arbitrarily large (when ndn\approx d). Our main finding is in fact the relationship between minimizing the norm of the ERM solution and minimizing a bound on stability.

This leads to a second observation, namely, that we can consider the limit of our risk bounds as both the sample size (nn) and the dimensionality of the data (dd) go to infinity. This is a classical setting in statistics which allows us to use results from random matrix theory (Marchenko & Pastur, 1967). In particular, for linear kernels the behavior of the smallest eigenvalue of the kernel matrix (which appears in our bounds) can be characterized in this asymptotic limit. Here the dimension of the data coincides with the number of parameters in the model. Interestingly, analogous results hold for more general kernels (El Karoui, 2010) where the asymptotics are taken with respect to the number and dimensionality of the data. These results predict a double descent curve for the condition number as found in practice, see Figure 2.

Recently, there has been a surge of interest in studying linear and kernel least squares models, since classical results focus on situations where constraints or penalties that prevent interpolation are added to the empirical risk. For example, high dimensional linear regression is considered in Mei & Montanari (2019); Hastie et al. (2019); Bartlett et al. (2019), and “ridgeless” kernel least squares is studied in Liang et al. (2019); Rakhlin & Zhai (2018) and Liang et al. (2020). While these papers study upper and lower bounds on the risk of interpolating solutions to the linear and kernel least squares problem, ours are the first to derived using stability arguments. While it might be possible to obtain tighter excess risk bounds through careful analysis of the minimum norm interpolant, our simple approach helps us establish a link between stability in statistical and in numerical sense.

Finally, we can compare our results with observations made in Poggio et al. (2019) on the condition number of random kernel matrices. The condition number of the empirical kernel matrix is known to control the numerical stability of the solution to a kernel least squares problem. Our results show that the statistical stability is also controlled by the condition number of the kernel matrix, providing a natural link between numerical and statistical stability.

Refer to caption
Figure 2: Typical double descent of the condition number (y axis) of a radial basis function kernel K(x,x)=exp(xx22σ2)K(x,x^{\prime})=\exp\left(-\frac{||x-x^{\prime}||^{2}}{2\sigma^{2}}\right) built from a random data matrix distributed as 𝒩(0,1)\mathcal{N}(0,1): as in the linear case, the condition number is worse when n=dn=d, better if n>dn>d (on the right of n=dn=d) and also better if n<dn<d (on the left of n=dn=d). The parameter σ\sigma was chosen to be 5. From Poggio et al. (2019)

6 Conclusions

In summary, minimizing a bound on cross validation stability minimizes the expected error in both the classical and the modern regime of ERM. In the classical regime (d<nd<n), CVloo{\mbox{CV}_{loo}} stability implies generalization and consistency for nn\to\infty. In the modern regime (d>nd>n), as described in this paper, CVloo{\mbox{CV}_{loo}} stability can account for the double descent curve in kernel interpolants ((Belkin et al., 2019) showed empirical evidence for closely related interpolation methods) under appropriate distributional assumptions. The main contribution of this paper is characterizing stability of interpolating solutions, in particular deriving excess risk bounds via a stability argument. In the process, we show that among all the interpolating solutions, the one with minimal norm also minimizes a bound on stability. Since the excess risk bounds of the minimum norm interpolant depend on the pseudoinverse of the kernel matrix, we establish here a clear link between numerical and statistical stability. This also holds for solutions computed by gradient descent, since gradient descent converges to minimum norm solutions in the case of “linear” kernel methods. Our approach is simple and combines basic stability results with matrix inequalities.

References

  • Baksalary et al. (2003) Jerzy K Baksalary, Oskar Maria Baksalary, and Götz Trenkler. A revisitation of formulae for the moore–penrose inverse of modified matrices. Linear Algebra and Its Applications, 372:207–224, 2003.
  • Bartlett et al. (2019) Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. CoRR, abs/1906.11300, 2019. URL http://arxiv.org/abs/1906.11300.
  • Belkin et al. (2019) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/content/116/32/15849.
  • Boucheron et al. (2005) Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
  • Bousquet & Elisseeff (2001) O. Bousquet and A. Elisseeff. Stability and generalization. Journal Machine Learning Research, 2001.
  • Bühlmann & Van De Geer (2011) Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
  • El Karoui (2010) Noureddine El Karoui. The spectrum of kernel random matrices. arXiv e-prints, art. arXiv:1001.0492, Jan 2010.
  • Hastie et al. (2019) Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv e-prints, art. arXiv:1903.08560, Mar 2019.
  • Kutin & Niyogi (2002) S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. Technical report TR-2002-03, University of Chicago, 2002.
  • Liang et al. (2019) Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels. arXiv e-prints, art. arXiv:1908.10292, Aug 2019.
  • Liang et al. (2020) Tengyuan Liang, Alexander Rakhlin, et al. Just interpolate: Kernel “ridgeless” regression can generalize. Annals of Statistics, 48(3):1329–1347, 2020.
  • Marchenko & Pastur (1967) V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mat. Sb. (N.S.), 72(114):4:457–483, 1967.
  • Mei & Montanari (2019) Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv e-prints, art. arXiv:1908.05355, Aug 2019.
  • Meyer (1973) Carl Meyer. Generalized inversion of modified matrices. SIAM J. Applied Math, 24:315–323, 1973.
  • Micchelli (1986) C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constructive Approximation, 2:11–22, 1986.
  • Mukherjee et al. (2006) Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006. ISSN 1572-9044. doi: 10.1007/s10444-004-7634-z. URL http://dx.doi.org/10.1007/s10444-004-7634-z.
  • Poggio et al. (2004) T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428:419–422, March 2004.
  • Poggio et al. (2019) T. Poggio, G. Kur, and A. Banburski. Double descent in the condition number. Technical report, MIT Center for Brains Minds and Machines, 2019.
  • Poggio (2020) Tomaso Poggio. Stable foundations for learning. Center for Brains, Minds and Machines (CBMM) Memo No. 103, 2020.
  • Rakhlin & Zhai (2018) Alexander Rakhlin and Xiyu Zhai. Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon. arXiv e-prints, art. arXiv:1812.11167, Dec 2018.
  • Rosasco & Villa (2015) Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 1630–1638. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization.pdf.
  • Shalev-Shwartz & Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014. ISBN 1107057132.
  • Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, December 2010. ISSN 1532-4435.
  • Steinwart & Christmann (2008) Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
  • Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.

Appendix A Excess Risk, Generalization, and Stability

We use the same notation as introduced in Section 2 for the various quantities considered in this section. That is in the supervised learning setup V(f,z)V(f,z) is the loss incurred by hypothesis ff on the sample zz, and I[f]=𝔼z[V(f,z)]I[f]={\mathbb{E}}_{z}[V(f,z)] is the expected error of hypothesis ff. Since we are interested in different forms of stability, we will consider learning problems over the original training set S={z1,z2,,zn}S=\{z_{1},z_{2},\ldots,z_{n}\}, the leave one out training set Si={z1,,zi1,zi+1,,zn}S_{i}=\{z_{1},\ldots,z_{i-1},z_{i+1},\ldots,z_{n}\}, and the replace one training set (Si,z)={z1,,zi1,zi+1,,zn,z}(S_{i},z)=\{z_{1},\ldots,z_{i-1},z_{i+1},\ldots,z_{n},z\}

A.1 Replace one and leave one out algorithmic stability

Similar to the definition of expected CVloo{\mbox{CV}_{loo}} stability in equation (3) of the main paper, we say an algorithm is cross validation replace one stable (in expectation), denoted as CVro{\mbox{CV}_{ro}}, if there exists βro>0\beta_{ro}>0 such that

𝔼S,z[V(fS,z)V(f(Si,z),z)]βro.{\mathbb{E}}_{S,z}[V(f_{S},z)-V(f_{(S_{i},z)},z)]\leq\beta_{ro}.

We can strengthen the above stability definition by introducing the notion of replace one algorithmic stability (in expectation) Bousquet & Elisseeff (2001). There exists αro>\alpha_{ro}> such that for all i=1,,ni=1,\dots,n,

𝔼S,z[fSf(Si,z)]αro.{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{(S_{i},z)}}\right\rVert_{\infty}]\leq\alpha_{ro}.

We make two observations:
First, if the loss is Lipschitz, that is if there exists CV>0C_{V}>0 such that for all f,ff,f^{\prime}\in{\cal H}

V(f,z)V(f,z)CVff,\left\lVert{V(f,z)-V(f^{\prime},z)}\right\rVert\leq C_{V}\left\lVert{f-f^{\prime}}\right\rVert,

then replace one algorithmic stability implies CVro{\mbox{CV}_{ro}} stability with βro=CVαro\beta_{ro}=C_{V}\alpha_{ro}. Moreover, the same result holds if the loss is locally Lipschitz and there exists R>0R>0, such that fSR\left\lVert{f_{S}}\right\rVert_{\infty}\leq R almost surely. In this latter case the Lipschitz constant will depend on RR. Later, we illustrate this situation for the square loss.

Second, we have for all i=1,,ni=1,\dots,n, SS and zz,

𝔼S,z[fSf(Si,z)]𝔼S,z[fSfSi]+𝔼S,z[f(Si,z)fSi].{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{(S_{i},z)}}\right\rVert_{\infty}]\leq{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{S_{i}}}\right\rVert_{\infty}]+{\mathbb{E}}_{S,z}[\left\lVert{f_{(S_{i},z)}-f_{S_{i}}}\right\rVert_{\infty}].

This observation motivates the notion of leave one out algorithmic stability (in expectation) Bousquet & Elisseeff (2001)]

𝔼S,z[fSfSi]αloo.{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{S_{i}}}\right\rVert_{\infty}]\leq\alpha_{loo}.

Clearly, leave one out algorithmic stability implies replace one algorithmic stability with αro=2αloo\alpha_{ro}=2\alpha_{loo} and it implies also CVro{\mbox{CV}_{ro}} stability with βro=2CVαloo\beta_{ro}=2C_{V}\alpha_{loo}.

A.2 Excess Risk and CVloo{\mbox{CV}_{loo}}, CVro{\mbox{CV}_{ro}} Stability

We recall the statement of Lemma 5 in section 3 that bounds the excess risk using the CVloo{\mbox{CV}_{loo}} stability of a solution.

Lemma 12 (Excess Risk & CVloo{\mbox{CV}_{loo}} Stability)

For all i=1,,ni=1,\dots,n,

𝔼S[I[fSi]inffI[f]]𝔼S[V(fSi,zi)V(fS,zi)].{\mathbb{E}}_{S}[I[f_{S_{i}}]-\inf_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]. (17)

In this section, two properties of ERM are useful, namely symmetry, and a form of unbiasedeness.

Symmetry.

A key property of ERM is that it is symmetric with respect to the data set SS, meaning that it does not depend on the order of the data in SS.

A second property relates the expected ERM with the minimum of expected risk.

ERM Bias.

The following inequality holds.

𝔼[[IS[fS]]minfI[f]0.{\mathbb{E}}[[I_{S}[f_{S}]]-\min_{f\in{\cal H}}I[f]\leq 0. (18)

To see this, note that

IS[fS]IS[f]I_{S}[f_{S}]\leq I_{S}[f]

for all ff\in{\cal H} by definition of ERM, so that taking the expectation of both sides

𝔼S[IS[fS]]𝔼S[IS[f]]=I[f]{\mathbb{E}}_{S}[I_{S}[f_{S}]]\leq{\mathbb{E}}_{S}[I_{S}[f]]=I[f]

for all ff\in{\cal H}. This implies

𝔼S[IS[fS]]minfI[f]{\mathbb{E}}_{S}[I_{S}[f_{S}]]\leq\min_{f\in{\cal H}}I[f]

and hence (18) holds.

Remark 13

Note that the same argument gives more generally that

𝔼[inff[IS[f]]inffI[f]0.{\mathbb{E}}[\inf_{f\in{\cal H}}[I_{S}[f]]-\inf_{f\in{\cal H}}I[f]\leq 0. (19)

Given the above premise, the proof of Lemma 5 is simple.

Proof [of Lemma 5] Adding and subtracting 𝔼S[IS[fS]]{\mathbb{E}}_{S}[I_{S}[f_{S}]] from the expected excess risk we have that

𝔼S[I[fSi]minfI[f]]=𝔼S[I[fSi]IS[fS]+IS[fS]minfI[f]],{\mathbb{E}}_{S}[I[f_{S_{i}}]-\min_{f\in{\cal H}}I[f]]={\mathbb{E}}_{S}[I[f_{S_{i}}]-I_{S}[f_{S}]+I_{S}[f_{S}]-\min_{f\in{\cal H}}I[f]], (20)

and since 𝔼S[IS[fS]]minfI[f]]{\mathbb{E}}_{S}[I_{S}[f_{S}]]-\min_{f\in{\cal H}}I[f]] is less or equal than zero, see (19), then

𝔼S[I[fSi]minfI[f]]𝔼S[I[fSi]IS[fS]].{\mathbb{E}}_{S}[I[f_{S_{i}}]-\min_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[I[f_{S_{i}}]-I_{S}[f_{S}]]. (21)

Moreover, for all i=1,,ni=1,\dots,n

𝔼S[I[fSi]]=𝔼S[𝔼ziV(fSi,zi)]=𝔼S[V(fSi,zi)]{\mathbb{E}}_{S}[I[f_{S_{i}}]]={\mathbb{E}}_{S}[{\mathbb{E}}_{z_{i}}V(f_{S_{i}},z_{i})]={\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})]

and

𝔼S[IS[fS]]=1ni=1n𝔼S[V(fS,zi)]=𝔼S[V(fS,zi)].{\mathbb{E}}_{S}[I_{S}[f_{S}]]=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}_{S}[V(f_{S},z_{i})]={\mathbb{E}}_{S}[V(f_{S},z_{i})].

Plugging these last two expressions in (21) and in (20) leads to (4).  


We can prove a similar result relating excess risk with CVro{\mbox{CV}_{ro}} stability.

Lemma 14 (Excess Risk & CVro{\mbox{CV}_{ro}} Stability)

Given the above definitions, the following inequality holds for all i=1,,ni=1,\dots,n,

𝔼S[I[fS]inffI[f]]𝔼S[I[fS]IS[fS]]=𝔼S,z[V(fS,z)V(f(Si,z),z)].{\mathbb{E}}_{S}[I[f_{S}]-\inf_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]]={\mathbb{E}}_{S,z}[V(f_{S},z)-V(f_{(S_{i},z)},z)]. (22)

Proof  The first inequality is clear from adding and subtracting IS[fS]I_{S}[f_{S}] from the expected risk I[fS]I[f_{S}] we have that

𝔼S[I[fS]minfI[f]]=𝔼S[I[fS]IS[fS]+IS[fS]minfI[f]],{\mathbb{E}}_{S}[I[f_{S}]-\min_{f\in{\cal H}}I[f]]={\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]+I_{S}[f_{S}]-\min_{f\in{\cal H}}I[f]],

and recalling (19). The main step in the proof is showing that for all i=1,,ni=1,\dots,n,

𝔼[IS[fS]]=𝔼[V(f(Si,z),z)]{\mathbb{E}}[I_{S}[f_{S}]]={\mathbb{E}}[V(f_{(S_{i},z)},z)] (23)

to be compared with the trivial equality, 𝔼[IS[fS]=𝔼[V(fS,zi)]{\mathbb{E}}[I_{S}[f_{S}]={\mathbb{E}}[V(f_{S},z_{i})]. To prove Equation (23), we have for all i=1,,ni=1,\dots,n,

𝔼S[IS[fS]]=𝔼S,z[1ni=1nV(fS,zi)]=1ni=1n𝔼S,z[V(f(Si,z),z)]=𝔼S,z[V(f(Si,z),z)]{\mathbb{E}}_{S}[I_{S}[f_{S}]]={\mathbb{E}}_{S,z}[\frac{1}{n}\sum_{i=1}^{n}V(f_{S},z_{i})]=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}_{S,z}[V(f_{(S_{i},z)},z)]={\mathbb{E}}_{S,z}[V(f_{(S_{i},z)},z)]

where we used the fact that by the symmetry of the algorithm 𝔼S,z[V(f(Si,z),z)]{\mathbb{E}}_{S,z}[V(f_{(S_{i},z)},z)] is the same for all i=1,,ni=1,\dots,n. The proof is concluded noting that 𝔼S[I[fS]]=𝔼S,z[V(fS,z)]{\mathbb{E}}_{S}[I[f_{S}]]={\mathbb{E}}_{S,z}[V(f_{S},z)].  


A.3 Discussion on Stability and Generalization

Below we discuss some more aspects of stability and its connection to other quantities in statistical learning theory.

Remark 15 (CVloo{\mbox{CV}_{loo}} stability in expectation and in probability)

In Mukherjee et al. (2006), CVloo{\mbox{CV}_{loo}} stability is defined in probability, that is there exists βCVP>0\beta^{P}_{CV}>0, 0<δCVP10<\delta^{P}_{CV}\leq 1 such that

S{|V(fSi,zi)V(fS,zi)|βCVP}δCVP.{\mathbb{P}}_{S}\{|V(f_{S_{i}},z_{i})-V(f_{S},z_{i})|\geq\beta^{P}_{CV}\}\leq\delta^{P}_{CV}.

Note that the absolute value is not needed for ERM since almost positivity holds Mukherjee et al. (2006), that is V(fSi,zi)V(fS,zi)>0.V(f_{S_{i}},z_{i})-V(f_{S},z_{i})>0. Then CVloo{\mbox{CV}_{loo}} stability in probability and in expectation are clearly related and indeed equivalent for bounded loss functions. CVloo{\mbox{CV}_{loo}} stability in expectation (3) is what we study in the following sections.

Remark 16 (Connection to uniform stability and other notions of stability)

Uniform stability, introduced by Bousquet & Elisseeff (2001), corresponds in our notation to the assumption that there exists βu>0\beta_{u}>0 such that for all i=1,,ni=1,\dots,n, supzZ|V(fSi,z)V(fS,z)|βu.\sup_{z\in Z}|V(f_{S_{i}},z)-V(f_{S},z)|\leq\beta_{u}. Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006).

Remark 17 (CVloo{\mbox{CV}_{loo}} Stability & Learnability)

A natural question is to which extent suitable notions of stability are not only sufficient but also necessary for controlling the excess risk of ERM. Classically, the latter is characterized in terms of a uniform version of the law of large numbers, which itself can be characterized in terms of suitable complexity measures of the hypothesis class. Uniform stability is too strong to characterize consistency while CVloo{\mbox{CV}_{loo}} stability turns out to provide a suitably weak definition as shown in Mukherjee et al. (2006), see also Kutin & Niyogi (2002), Mukherjee et al. (2006). Indeed, a main result in Mukherjee et al. (2006) shows that CVloo{\mbox{CV}_{loo}} stability is equivalent to consistency of ERM:

Theorem 18

Mukherjee et al. (2006) For ERM and bounded loss functions, CVloo{\mbox{CV}_{loo}} stability in probability with βCVP\beta^{P}_{CV} converging to zero for nn\to\infty is equivalent to consistency and generalization of ERM.

Remark 19 (CVloo{\mbox{CV}_{loo}} stability & in-sample/out-of-sample error)

Let (S,z)={z1,,zn,z},(S,z)=\{z_{1},\dots,z_{n},z\}, (zz is a data point drawn according to the same distribution) and the corresponding ERM solution f(S,z)f_{(S,z)}, then (4) can be equivalently written as,

𝔼S[I[fS]inffI[f]]𝔼S,z[V(fS,z)V(f(S,z),z)].{\mathbb{E}}_{S}[I[f_{S}]-\inf_{f\in\cal F}I[f]]\leq{\mathbb{E}}_{S,z}[V(f_{S},z)-V(f_{(S,z)},z)].

Thus CVloo{\mbox{CV}_{loo}} stability measures how much the loss changes when we test on a point that is present in the training set and absent from it. In this view, it can be seen as an average measure of the difference between in-sample and out-of-sample error.

Remark 20 (CVloo{\mbox{CV}_{loo}} stability and generalization)

A common error measure is the (expected) generalization gap 𝔼S[I[fS]IS[fS]].{\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]]. For non-ERM algorithms, CVloo{\mbox{CV}_{loo}} stability by itself not sufficient to control this term, and further conditions are needed Mukherjee et al. (2006), since

𝔼S[I[fS]IS[fS]]=𝔼S[I[fS]IS[fSi]]+𝔼S[IS[fSi]IS[fS]].{\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]]={\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S_{i}}]]+{\mathbb{E}}_{S}[I_{S}[f_{S_{i}}]-I_{S}[f_{S}]].

The second term becomes for all i=1,,ni=1,\dots,n,

𝔼S[IS[fSi]IS[fS]]=1ni=1n𝔼S[V(fSi,zi)V(fS,zi)]=𝔼S[V(fSi,zi)V(fS,zi)]{\mathbb{E}}_{S}[I_{S}[f_{S_{i}}]-I_{S}[f_{S}]]=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]={\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]

and hence is controlled by CV stability. The first term is called expected leave one out error in Mukherjee et al. (2006) and is controlled in ERM as nn\to\infty, see Theorem 18 above.

Appendix B CVloo{\mbox{CV}_{loo}} Stabililty of Linear Regression

We have a dataset S={(𝐱i,yi)}i=1nS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n} and we want to find a mapping 𝐰d\mathbf{w}\in\operatorname{\mathbb{R}}^{d}, that minimizes the empirical least squares risk. All interpolating solutions are of the form 𝐰^S=𝐲𝐗+𝐯(𝐈𝐗𝐗)\hat{\mathbf{w}}_{S}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}). Similarly, all interpolating solutions on the leave one out dataset SiS_{i} can be written as 𝐰^Si=𝐲i(𝐗i)+𝐯i(𝐈𝐗i(𝐗i))\hat{\mathbf{w}}_{S_{i}}=\mathbf{y}_{i}^{\top}({\mathbf{X}}_{i})^{\dagger}+\mathbf{v}_{i}^{\top}(\mathbf{I}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}). Here 𝐗,𝐗id×n{\mathbf{X}},{\mathbf{X}}_{i}\in\operatorname{\mathbb{R}}^{d\times n} are the data matrices for the original and leave one out datasets respectively. We note that when 𝐯=𝟎\mathbf{v}=\mathbf{0} and 𝐯i=𝟎\mathbf{v}_{i}=\mathbf{0}, we obtain the minimum norm interpolating solutions on the datasets SS and SiS_{i}.

In this section we want to estimate the CVloo{\mbox{CV}_{loo}} stability of the minimum norm solution to the ERM problem in the linear regression case. This is the case outlined in Remark 9 of the main paper. In order to prove Remark 9, we only need to combine Lemma 10 with the linear regression analogue of Lemma 11. We state and prove that result in this section. This result predicts a double descent curve for the norm of the pseudoinverse as found in practice, see Figure 3.

Lemma 21

Let 𝐰^S,𝐰^Si\hat{\mathbf{w}}_{S},\hat{\mathbf{w}}_{S_{i}} be any interpolating least squares solutions on the full and leave one out datasets S,SiS,S_{i}, then 𝐰^S𝐰^SiBCV(𝐗,𝐲,𝐯,𝐯i)\left\lVert{\hat{\mathbf{w}}_{S}-\hat{\mathbf{w}}_{S_{i}}}\right\rVert\leq B_{CV}({\mathbf{X}}^{\dagger},\mathbf{y},\mathbf{v},\mathbf{v}_{i}), and BCVB_{CV} is minimized when 𝐯=𝐯i=𝟎\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}, which corresponds to the minimum norm interpolating solutions 𝐰S,𝐰Si\mathbf{w}_{S}^{\dagger},\mathbf{w}_{S_{i}}^{\dagger}.

Also,

𝐰S𝐰Si𝐗op𝐲\left\lVert{\mathbf{w}_{S}^{\dagger}-\mathbf{w}_{S_{i}}^{\dagger}}\right\rVert\leq\left\lVert{{\mathbf{X}}^{\dagger}}\right\rVert_{op}\left\lVert{\mathbf{y}}\right\rVert (24)

As mentioned before in section 2.1 of the main paper, linear regression can be viewed as a case of the kernel regression problem where =d{\cal H}=\operatorname{\mathbb{R}}^{d}, and the feature map Φ\Phi is the identity map. The inner product and norms considered in this case are also the usual Euclidean inner product and 22-norm for vectors in d\operatorname{\mathbb{R}}^{d}. The notation \left\lVert{\cdot}\right\rVert denotes the Euclidean norm for vectors both in d\operatorname{\mathbb{R}}^{d} and n\operatorname{\mathbb{R}}^{n}. The usage of the norm should be clear from the context. Also, 𝐀op\left\lVert{\mathbf{A}}\right\rVert_{op} is the left operator norm for a matrix 𝐀n×d\mathbf{A}\in\operatorname{\mathbb{R}}^{n\times d}, that is 𝐀op=sup𝐲n,𝐲=1𝐲𝐀\left\lVert{\mathbf{A}}\right\rVert_{op}=\sup_{\mathbf{y}\in\operatorname{\mathbb{R}}^{n},||\mathbf{y}||=1}||\mathbf{y}^{\top}\mathbf{A}||.

We have nn samples in the training set for a linear regression problem, {(𝐱i,yi)}i=1n\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}. We collect all the samples into a single matrix/vector 𝐗=[𝐱1𝐱2𝐱n]d×n{\mathbf{X}}=[\mathbf{x}_{1}\mathbf{x}_{2}\ldots\mathbf{x}_{n}]\in\operatorname{\mathbb{R}}^{d\times n}, and 𝐲=[y1y2yn]n\mathbf{y}=[y_{1}y_{2}\ldots y_{n}]^{\top}\in\operatorname{\mathbb{R}}^{n}. Then any interpolating ERM solution 𝐰S\mathbf{w}_{S} satisfies the linear equation

𝐰S𝐗=𝐲\mathbf{w}_{S}^{\top}{\mathbf{X}}=\mathbf{y}^{\top} (25)

Any interpolating solution can be written as:

(𝐰^S)=𝐲𝐗+𝐯(𝐈𝐗𝐗).(\hat{\mathbf{w}}_{S})^{\top}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}). (26)
Refer to caption
Figure 3: Typical double descent of the pseudoinverse norm (y axis) of a random data matrix distributed as 𝒩(0,1)\mathcal{N}(0,1): the condition number is worse when n=dn=d, better if n>dn>d (on the right of n=dn=d) and also better if n<dn<d (on the left of n=dn=d).. From Poggio et al. (2019)

If we consider the leave one out training set SiS_{i} we can find the minimum norm ERM solution for 𝐗i=[𝐱1𝟎𝐱n]{\mathbf{X}}_{i}=[\mathbf{x}_{1}\ldots\mathbf{0}\ldots\mathbf{x}_{n}] and 𝐲i=[y10yn]\mathbf{y}_{i}=[y_{1}\ldots 0\ldots y_{n}]^{\top} as

(𝐰^Si)=𝐲i(𝐗i)+𝐯i(𝐈𝐗i(𝐗i)).(\hat{\mathbf{w}}_{S_{i}})^{\top}=\mathbf{y}_{i}^{\top}({\mathbf{X}}_{i})^{\dagger}+\mathbf{v}_{i}^{\top}(\mathbf{I}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}). (27)

We can write 𝐗i{\mathbf{X}}_{i} as:

𝐗i=𝐗+𝐚𝐛{\mathbf{X}}_{i}={\mathbf{X}}+\mathbf{a}\mathbf{b}^{\top} (28)

where 𝐚d\mathbf{a}\in\operatorname{\mathbb{R}}^{d} is a column vector representing the additive change to the ithi^{\textrm{th}} column, i.e, 𝐚=𝐱i\mathbf{a}=-\mathbf{x}_{i}, and 𝐛n×1\mathbf{b}\in\operatorname{\mathbb{R}}^{n\times 1} is the ii-th element of the canonical basis in n\operatorname{\mathbb{R}}^{n} (all the coefficients are zero but the ii-th which is one). Thus 𝐚𝐛\mathbf{a}\mathbf{b}^{\top} is a d×nd\times n matrix composed of all zeros apart from the ithi^{\textrm{th}} column which is equal to 𝐚\mathbf{a}.

We also have 𝐲i=𝐲yi𝐛\mathbf{y}_{i}=\mathbf{y}-y_{i}\mathbf{b}. Now per Lemma 10 we are interested in bounding the quantity 𝐰^Si𝐰^S=(𝐰^Si)(𝐰^S)||\hat{\mathbf{w}}_{S_{i}}-\hat{\mathbf{w}}_{S}||=||(\hat{\mathbf{w}}_{S_{i}})^{\top}-(\hat{\mathbf{w}}_{S})^{\top}||. This simplifies to:

𝐰^Si𝐰^S=𝐲i(𝐗i)𝐲𝐗+𝐯i𝐯+𝐯𝐗𝐗𝐯i𝐗i(𝐗i)=(𝐲yi𝐛)(𝐗i)𝐲𝐗+𝐯i𝐯+𝐯𝐗𝐗𝐯i𝐗i(𝐗i)=𝐲((𝐗i)𝐗)+yi𝐛(𝐗i)+𝐯i𝐯+𝐯𝐗𝐗𝐯i𝐗i(𝐗i)=𝐲((𝐗i)𝐗)+𝐯i𝐯+𝐯𝐗𝐗𝐯i𝐗i(𝐗i)=𝐲((𝐗i)𝐗)+(𝐯i𝐯)(𝐈𝐗𝐗)𝐯i(𝐗𝐗𝐗i(𝐗i))\begin{split}||\hat{\mathbf{w}}_{S_{i}}-\hat{\mathbf{w}}_{S}||&=||\mathbf{y}_{i}^{\top}({\mathbf{X}}_{i})^{\dagger}-\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||(\mathbf{y}^{\top}-y_{i}\mathbf{b}^{\top})({\mathbf{X}}_{i})^{\dagger}-\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+y_{i}\mathbf{b}^{\top}({\mathbf{X}}_{i})^{\dagger}+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+(\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top})(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})-\mathbf{v}_{i}^{\top}({\mathbf{X}}{\mathbf{X}}^{\dagger}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger})||\end{split} (29)

In the above equation we make use of the fact that 𝐛(𝐗i)=𝟎\mathbf{b}^{\top}({\mathbf{X}}_{i})^{\dagger}=\mathbf{0}. We use an old formula (Meyer, 1973; Baksalary et al., 2003) to compute (𝐗i)({\mathbf{X}}_{i})^{\dagger} from 𝐗{\mathbf{X}}^{\dagger}. We use the development of pseudo-inverses of perturbed matrices in Meyer (1973). We see that 𝐚=𝐱i\mathbf{a}=-\mathbf{x}_{i} is a vector in the column space of 𝐗{\mathbf{X}} and 𝐛\mathbf{b} is in the range space of 𝐗T{\mathbf{X}}^{T} (provided 𝐗{\mathbf{X}} has full column rank), with β=1+𝐛𝐗𝐚=1𝐛𝐗𝐱i=0\beta=1+\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}\mathbf{a}=1-\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}\mathbf{x}_{i}=0. This means we can use Theorem 6 in Meyer (1973) (equivalent to formula 2.1 in Baksalary et al. (2003)) to obtain the expression for (𝐗i)({\mathbf{X}}_{i})^{\dagger}

(𝐗i)=𝐗𝐤𝐤𝐗𝐗𝐡𝐡+(𝐤𝐗𝐡)𝐤𝐡({\mathbf{X}}_{i})^{\dagger}={\mathbf{X}}^{\dagger}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}-{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}+(\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{k}\mathbf{h} (30)

where 𝐤=𝐗𝐚\mathbf{k}={\mathbf{X}}^{\dagger}\mathbf{a}, and 𝐡=𝐛𝐗\mathbf{h}=\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}, and 𝐮=𝐮𝐮2\mathbf{u}^{\dagger}=\frac{\mathbf{u}^{\top}}{||\mathbf{u}||^{2}} for any non-zero vector 𝐮\mathbf{u}.

(𝐗i)𝐗=(𝐤𝐗𝐡)𝐤𝐡𝐤𝐤𝐗𝐗𝐡𝐡=(𝐛𝐗𝐡)𝐛𝐡𝐛𝐡𝐗𝐡𝐡(𝐗i)𝐗op=𝐗𝐡𝐡op𝐗op\begin{split}({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger}&=(\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{k}\mathbf{h}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}-{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}\\ &=(\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{b}\mathbf{h}-\mathbf{b}\mathbf{h}-{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}\\ \implies||({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger}||_{op}&=||{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}||_{op}\\ &\leq||{\mathbf{X}}^{\dagger}||_{op}\end{split} (31)

The above set of inequalities follows from the fact that the operator norm of a rank 1 matrix is given by 𝐮𝐯op=𝐮×𝐯||\mathbf{u}\mathbf{v}^{\top}||_{op}=||\mathbf{u}||\times||\mathbf{v}||, and by noticing that 𝐤=𝐛\mathbf{k}=-\mathbf{b}.

Also, from List 2 of Baksalary et al. (2003) we have that 𝐗i(𝐗i)=𝐗𝐗𝐡𝐡{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}={\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{h}^{\dagger}\mathbf{h}.

Plugging in these calculations into equation 29 we get:

𝐰^Si𝐰^S=𝐲((𝐗i)𝐗)+(𝐯i𝐯)(𝐈𝐗𝐗)𝐯i(𝐗𝐗𝐗i(𝐗i))B0+𝐈𝐗𝐗op𝐯𝐯i+𝐯i×𝐡𝐡opB0+2𝐯𝐯i+𝐯i\begin{split}||\hat{\mathbf{w}}_{S_{i}}-\hat{\mathbf{w}}_{S}||&=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+(\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top})(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})-\mathbf{v}_{i}^{\top}({\mathbf{X}}{\mathbf{X}}^{\dagger}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger})||\\ &\leq B_{0}+||\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}||_{op}||\mathbf{v}-\mathbf{v}_{i}||+||\mathbf{v}_{i}||\times||\mathbf{h}^{\dagger}\mathbf{h}||_{op}\\ &\leq B_{0}+2||\mathbf{v}-\mathbf{v}_{i}||+||\mathbf{v}_{i}||\end{split} (32)

We see that the right hand side is minimized when 𝐯=𝐯i=𝟎\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}. We can also compute B0=𝐗op𝐲B_{0}=||{\mathbf{X}}^{\dagger}||_{op}||\mathbf{y}||, which concludes the proof of Lemma 21.