For interpolating kernel machines, minimizing the norm of the ERM solution minimizes stability

Akshay Rangamani [email protected] Center for Brains, Minds, and Machines, MIT McGovern Institute for Brain Research, MIT Lorenzo Rosasco [email protected] Center for Brains, Minds, and Machines, MIT Tomaso Poggio [email protected] Center for Brains, Minds, and Machines, MIT McGovern Institute for Brain Research, MIT

Abstract

We study the average ${\mbox{CV}_{loo}}$ stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm minimizes a bound on ${\mbox{CV}_{loo}}$ stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error should be expected to follow a double descent curve.

1 Introduction

Statistical learning theory studies the learning properties of machine learning algorithms, and more fundamentally, the conditions under which learning from finite data is possible. In this context, classical learning theory focuses on the size of the hypothesis space in terms of different complexity measures, such as combinatorial dimensions, covering numbers and Rademacher/Gaussian complexities (Shalev-Shwartz & Ben-David, 2014; Boucheron et al., 2005). Another more recent approach is based on defining suitable notions of stability with respect to perturbation of the data (Bousquet & Elisseeff, 2001; Kutin & Niyogi, 2002). In this view, the continuity of the process that maps data to estimators is crucial, rather than the complexity of the hypothesis space. Different notions of stability can be considered, depending on the data perturbation and metric considered (Kutin & Niyogi, 2002). Interestingly, the stability and complexity approaches to characterizing the learnability of problems are not at odds with each other, and can be shown to be equivalent( Poggio et al. (2004) and Shalev-Shwartz et al. (2010)).

In modern machine learning overparameterized models, with a larger number of parameters than the size of the training data, have become common. The ability of these models to generalize is well explained by classical statistical learning theory as long as some form of regularization is used in the training process (Bühlmann & Van De Geer, 2011; Steinwart & Christmann, 2008). However, it was recently shown - first for deep networks (Zhang et al., 2017), and more recently for kernel methods (Belkin et al., 2019) - that learning is possible in the absence of regularization, i.e., when perfectly fitting/interpolating the data. Much recent work in statistical learning theory has tried to find theoretical ground for this empirical finding. Since learning using models that interpolate is not exclusive to deep neural networks, we study generalization in the presence of interpolation in the case of kernel methods. We study both linear and kernel least squares problems in this paper.

Our Contributions:

•

We characterize the generalization properties of interpolating solutions for linear and kernel least squares problems using a stability approach. While the (uniform) stability properties of regularized kernel methods are well known (Bousquet & Elisseeff, 2001), we study interpolating solutions of the unregularized ("ridgeless") regression problems.
•

We obtain an upper bound on the stability of interpolating solutions, and show that this upper bound is minimized by the minimum norm interpolating solution. This also means that among all interpolating solutions, the minimum norm solution has the best test error. In particular, the same conclusion is also true for gradient descent, since it converges to the minimal norm solution in the setting we consider, see e.g. Rosasco & Villa (2015).
•

Our stability bounds show that the average stability of the minimum norm solution is controlled by the condition number of the empirical kernel matrix. It is well known that the numerical stability of the least squares solution is governed by the condition number of the associated kernel matrix (see Poggio et al. (2019)). Our results show that the condition number also controls stability (and hence, test error) in a statistical sense.

Organization:

In section 2, we introduce basic ideas in statistical learning and empirical risk minimization, as well as the notation used in the rest of the paper. In section 3, we briefly recall some definitions of stability. In section 4, we study the stability of interpolating solutions to kernel least squares and show that the minimum norm solutions minimize an upper bound on the stability. In section 5 we discuss our results in the context of recent work on high dimensional regression. We conclude in section 6.

2 Statistical Learning and Empirical Risk Minimization

We begin by recalling the basic ideas in statistical learning theory. In this setting, $X$ is the space of features, $Y$ is the space of targets or labels, and there is an unknown probability distribution $\mu$ on the product space $Z=X\times Y$ . In the following, we consider $X=\operatorname{\mathbb{R}}^{d}$ and $Y=\operatorname{\mathbb{R}}$ . The distribution $\mu$ is fixed but unknown, and we are given a training set $S$ consisting of $n$ samples (thus $|S|=n$ ) drawn i.i.d. from the probability distribution on $Z^{n}$ , $S=(z_{i})_{i=1}^{n}=(\mathbf{x}_{i},y_{i})_{i=1}^{n}.$ Intuitively, the goal of supervised learning is to use the training set $S$ to “learn” a function $f_{S}$ that evaluated at a new value $\mathbf{x}_{new}$ should predict the associated value of $y_{new}$ , i.e. $y_{new}\approx f_{S}(\mathbf{x}_{new}).$

The loss is a function $V:{\cal F}\times Z\to[0,\infty)$ , where ${\cal F}$ is the space of measurable functions from $X$ to $Y$ , that measures how well a function performs on a data point. We define a hypothesis space ${\cal H}\subseteq{\cal F}$ where algorithms search for solutions. With the above notation, the expected risk of $f$ is defined as $I[f]={\mathbb{E}}_{z}V(f,z)$ which is the expected loss on a new sample drawn according to the data distribution $\mu$ . In this setting, statistical learning can be seen as the problem of finding an approximate minimizer of the expected risk given a training set $S$ . A classical approach to derive an approximate solution is empirical risk minimization (ERM) where we minimize the empirical risk $I_{S}[f]=\frac{1}{n}\sum_{i=1}^{n}V(f,z_{i})$ .

A natural error measure for our ERM solution $f_{S}$ is the expected excess risk ${\mathbb{E}}_{S}[I[f_{S}]-\min_{f\in{\cal H}}I[f]].$ Another common error measure is the expected generalization error/gap given by ${\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]].$ These two error measures are closely related since, the expected excess risk is easily bounded by the expected generalization error (see Lemma 5).

2.1 Kernel Least Squares and Minimal Norm Solution

The focus in this paper is on the kernel least squares problem. We assume the loss function $V$ is the square loss, that is, $V(f,z)=(y-f(\mathbf{x}))^{2}.$ The hypothesis space is assumed to be a reproducing kernel Hilbert space, defined by a positive definite kernel $K:X\times X\to\operatorname{\mathbb{R}}$ or an associated feature map $\Phi:X\to{\cal H}$ , such that $K(\mathbf{x},\mathbf{x}^{\prime})=\langle{\Phi(\mathbf{x})},{\Phi(\mathbf{x}^{\prime})}\rangle_{\cal H}$ for all $\mathbf{x},\mathbf{x}^{\prime}\in X$ , where $\langle{\cdot},{\cdot}\rangle_{\cal H}$ is the inner product in ${\cal H}$ . In this setting, functions are linearly parameterized, that is there exists $w\in{\cal H}$ such that $f(\mathbf{x})=\langle{w},{\Phi(\mathbf{x})}\rangle_{\cal H}$ for all $x\in X$ .

The ERM problem typically has multiple solutions, one of which is the minimal norm solution:

f_{S}^{\dagger}=\operatorname*{arg\,min}_{f\in{\cal M}}\left\lVert{f}\right\rVert_{\cal H},\quad\quad{\cal M}=\operatorname*{arg\,min}_{f\in{\cal H}}\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-y_{i})^{2}.

(1)

Here $\left\lVert{\cdot}\right\rVert_{\cal H}$ is the norm on ${\cal H}$ induced by the inner product. The minimal norm solution can be shown to be unique and satisfy a representer theorem, that is for all $\mathbf{x}\in X$ :

f_{S}^{\dagger}(\mathbf{x})=\sum_{i=1}^{n}K(\mathbf{x},\mathbf{x}_{i})c_{S,i},\quad\quad\mathbf{c}_{S}={\mathbf{K}}^{\dagger}{\mathbf{y}}

(2)

where $\mathbf{c}_{S}=(c_{S,1},\dots,c_{S,n}),\mathbf{y}=(y_{1}\ldots y_{n})\in\operatorname{\mathbb{R}}^{n}$ , ${\mathbf{K}}$ is the $n$ by $n$ matrix with entries ${\mathbf{K}}_{ij}=K(\mathbf{x}_{i},\mathbf{x}_{j})$ , $i,j=1,\dots,n$ , and ${\mathbf{K}}^{\dagger}$ is the Moore-Penrose pseudoinverse of ${\mathbf{K}}$ . If we assume $n\leq d$ and that we have $n$ linearly independent data features, that is the rank of ${\mathbf{X}}$ is $n$ , then it is possible to show that for many kernels one can replace ${\mathbf{K}}^{\dagger}$ by ${\mathbf{K}}^{-1}$ (see Remark 2). Note that invertibility is necessary and sufficient for interpolation. That is, if ${\mathbf{K}}$ is invertible, $f_{S}^{\dagger}(\mathbf{x}_{i})=y_{i}$ for all $i=1,\dots,n$ , in which case the training error in (1) is zero.

Remark 1 (Pseudoinverse for underdetermined linear systems)

A simple yet relevant example are linear functions $f(\mathbf{x})=\mathbf{w}^{\top}\mathbf{x}$ , that correspond to ${\cal H}=\operatorname{\mathbb{R}}^{d}$ and $\Phi$ the identity map. If the rank of ${\mathbf{X}}\in\operatorname{\mathbb{R}}^{d\times n}$ is $n$ , then any interpolating solution $\mathbf{w}_{S}$ satisfies $\mathbf{w}_{S}^{\top}\mathbf{x}_{i}=y_{i}$ for all $i=1,\dots,n$ , and the minimal norm solution, also called Moore-Penrose solution, is given by $(\mathbf{w}_{S}^{\dagger})^{\top}={\mathbf{y}}^{\top}{\mathbf{X}}^{\dagger}$ where the pseudoinverse ${\mathbf{X}}^{\dagger}$ takes the form ${\mathbf{X}}^{\dagger}={\mathbf{X}}^{\top}({\mathbf{X}}{\mathbf{X}}^{\top})^{-1}.$

Remark 2 (Invertibility of translation invariant kernels)

Translation invariant kernels are a family of kernel functions given by $K(\mathbf{x}_{1},\mathbf{x}_{2})=k(\mathbf{x}_{1}-\mathbf{x}_{2})$ where $k$ is an even function on $\operatorname{\mathbb{R}}^{d}$ . Translation invariant kernels are Mercer kernels (positive semidefinite) if the Fourier transform of $k(\cdot)$ is non-negative. For Radial Basis Function kernels ( $K(\mathbf{x}_{1},\mathbf{x}_{2})=k(||\mathbf{x}_{1}-\mathbf{x}_{2}||)$ ) we have the additional property due to Theorem 2.3 of Micchelli (1986) that for distinct points $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n}\in\operatorname{\mathbb{R}}^{d}$ the kernel matrix ${\mathbf{K}}$ is non-singular and thus invertible.

The above discussion is directly related to regularization approaches.

Remark 3 (Stability and Tikhonov regularization)

Tikhonov regularization is used to prevent potential unstable behaviors. In the above setting, it corresponds to replacing Problem (1) by $\min_{f\in{\cal H}}\frac{1}{n}\sum_{i=1}^{n}(f(\mathbf{x}_{i})-y_{i})^{2}+\lambda\left\lVert{f}\right\rVert^{2}_{\cal H}$ where the corresponding unique solution is given by $f_{S}^{\lambda}(\mathbf{x})=\sum_{i=1}^{n}K(\mathbf{x},\mathbf{x}_{i})c_{i},\quad\quad\mathbf{c}=({\mathbf{K}}+\lambda\mathbf{I}_{n})^{-1}{\mathbf{y}}.$ In contrast to ERM solutions, the above approach prevents interpolation. The properties of the corresponding estimator are well known. In this paper, we complement these results focusing on the case $\lambda\to 0$ .

Finally, we end by recalling the connection between minimal norm and the gradient descent.

Remark 4 (Minimum norm and gradient descent)

In our setting, it is well know the both batch and stochastic gradient iterations converge exactly to the minimal norm solution, when multiple solutions exist, see e.g. Rosasco & Villa (2015). Thus, a study of the properties of minimal norm solutions explain the properties of the solution to which gradient descent converges. In particular, when ERM has multiple interpolating solutions, gradient descent converges to a solution that minimizes a bound on stability, as we show next.

3 Error Bounds via Stability

In this section, we recall basic results relating the learning and stability properties of Empirical Risk Minimization (ERM). Throughout the paper, we assume that ERM achieves a minimum, albeit the extension to almost minimizer is possible (Mukherjee et al., 2006) and important for exponential-type loss functions (Poggio, 2020). We do not assume the expected risk to achieve a minimum. Since we will be considering leave-one-out stability in this section, we look at solutions to ERM over the complete training set $S=\{z_{1},z_{2},\dots,z_{n}\}$ and the leave one out training set $S_{i}=\{z_{1},z_{2},\dots,z_{i-1},z_{i+1},\dots,z_{n}\}$

The excess risk of ERM can be easily related to its stability properties. Here, we follow the definition laid out in Mukherjee et al. (2006) and say that an algorithm is Cross-Validation leave-one-out ( ${\mbox{CV}_{loo}}$ ) stable in expectation, if there exists $\beta_{CV}>0$ such that for all $i=1,\dots,n$ ,

{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]\leq\beta_{CV}.

(3)

This definition is justified by the following result that bounds the excess risk of a learning algorithm by its average ${\mbox{CV}_{loo}}$ stability.

Lemma 5 (Excess Risk & ${\mbox{CV}_{loo}}$ Stability)

For all $i=1,\dots,n$ ,

{\mathbb{E}}_{S}[I[f_{S_{i}}]-\inf_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})].

(4)

Remark 6 (Connection to uniform stability and other notions of stability)

Uniform stability, introduced by Bousquet & Elisseeff (2001), corresponds in our notation to the assumption that there exists $\beta_{u}>0$ such that for all $i=1,\dots,n$ , $\sup_{z\in Z}|V(f_{S_{i}},z)-V(f_{S},z)|\leq\beta_{u}.$ Clearly this is a strong notion implying most other definitions of stability. We note that there are number of different notions of stability. We refer the interested reader to Kutin & Niyogi (2002) , Mukherjee et al. (2006).

We present the proof of Lemma 5 in Appendix A.2 due to lack of space. In Appendix A, we also discuss other definitions of stability and their connections to concepts in statistical learning theory like generalization and learnability.

4 ${\mbox{CV}_{loo}}$ Stability of Kernel Least Squares

In this section we analyze the expected ${\mbox{CV}_{loo}}$ stability of interpolating solutions to the kernel least squares problem, and obtain an upper bound on their stability. We show that this upper bound on the expected ${\mbox{CV}_{loo}}$ stability is smallest for the minimal norm interpolating solution (1) when compared to other interpolating solutions to the kernel least squares problem.

We have a dataset $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ and we want to find a mapping $f\in{\cal H}$ , that minimizes the empirical least squares risk. Here ${\cal H}$ is a reproducing kernel hilbert space (RKHS) defined by a positive definite kernel $K:X\times X\to\operatorname{\mathbb{R}}$ . All interpolating solutions are of the form $\hat{f}_{S}(\cdot)=\sum_{j=1}^{n}\hat{c}_{S,j}K(\mathbf{x}_{j},\cdot)$ , where $\hat{\mathbf{c}}_{S}={\mathbf{K}}^{\dagger}\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})\mathbf{v}$ . Similarly, all interpolating solutions on the leave one out dataset $S_{i}$ can be written as $\hat{f}_{S_{i}}(\cdot)=\sum_{j=1,j\neq i}^{n}\hat{c}_{S_{i},j}K(\mathbf{x}_{j},\cdot)$ , where $\hat{\mathbf{c}}_{S_{i}}={\mathbf{K}}_{S_{i}}^{\dagger}\mathbf{y}_{i}+(\mathbf{I}-{\mathbf{K}}_{S_{i}}^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}$ . Here ${\mathbf{K}},{\mathbf{K}}_{S_{i}}$ are the empirical kernel matrices on the original and leave one out datasets respectively. We note that when $\mathbf{v}=\mathbf{0}$ and $\mathbf{v}_{i}=\mathbf{0}$ , we obtain the minimum norm interpolating solutions on the datasets $S$ and $S_{i}$ .

Theorem 7 (Main Theorem)

Consider the kernel least squares problem with a bounded kernel and bounded outputs $y$ , that is there exist $\kappa,M>0$ such that

K(\mathbf{x},\mathbf{x}^{\prime})\leq\kappa,\quad\quad\quad|y|\leq M,

(5)

almost surely. Then for any interpolating solutions $\hat{f}_{S_{i}},\hat{f}_{S}$ ,

{\mathbb{E}}_{S}[V(\hat{f}_{S_{i}},z_{i})-V(\hat{f}_{S},z_{i})]\leq\beta_{CV}({\mathbf{K}}^{\dagger},\mathbf{y},\mathbf{v},\mathbf{v}_{i})

(6)

This bound $\beta_{CV}$ is minimized when $\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}$ , which corresponds to the minimum norm interpolating solutions $f_{S}^{\dagger},f_{S_{i}}^{\dagger}$ . For the minimum norm solutions we have $\beta_{CV}=C_{1}\beta_{1}+C_{2}\beta_{2}$ , where $\beta_{1}={\mathbb{E}}_{S}\left[||{\mathbf{K}}^{\frac{1}{2}}||_{op}||{\mathbf{K}}^{\dagger}||_{op}\times\textrm{cond}({\mathbf{K}})\times||\mathbf{y}||\right]$ and, $\beta_{2}={\mathbb{E}}_{S}\left[||{\mathbf{K}}^{\frac{1}{2}}||_{op}^{2}||{\mathbf{K}}^{\dagger}||_{op}^{2}\times(\textrm{cond}({\mathbf{K}}))^{2}\times||\mathbf{y}||^{2}\right]$ , and $C_{1},C_{2}$ are absolute constants that do not depend on either $d$ or $n$ .

In the above theorem $||{\mathbf{K}}||_{op}$ refers to the operator norm of the kernel matrix ${\mathbf{K}}$ , $||\mathbf{y}||$ refers to the standard $\ell_{2}$ norm for $\mathbf{y}\in\operatorname{\mathbb{R}}^{n}$ , and $\textrm{cond}({\mathbf{K}})$ is the condition number of the matrix ${\mathbf{K}}$ .

We can combine the above result with Lemma 5 to obtain the following bound on excess risk for minimum norm interpolating solutions to the kernel least squares problem:

Corollary 8

The excess risk of the minimum norm interpolating kernel least squares solution can be bounded as:

{\mathbb{E}}_{S}\left[I[f_{S_{i}}^{\dagger}]-\inf_{f\in{\cal H}}I[f]\right]\leq C_{1}\beta_{1}+C_{2}\beta_{2}

where $\beta_{1},\beta_{2}$ are as defined previously.

Remark 9 (Underdetermined Linear Regression)

In the case of underdetermined linear regression, ie, linear regression where the dimensionality is larger than the number of samples in the training set, we can prove a version of Theorem 7 with $\beta_{1}={\mathbb{E}}_{S}\left[\left\lVert{{\mathbf{X}}^{\dagger}}\right\rVert_{op}\left\lVert{\mathbf{y}}\right\rVert\right]$ and $\beta_{2}={\mathbb{E}}_{S}\left[\left\lVert{{\mathbf{X}}^{\dagger}}\right\rVert_{op}^{2}\left\lVert{\mathbf{y}}\right\rVert^{2}\right]$ . Due to space constraints, we present the proof of the results in the linear regression case in Appendix B.

4.1 Key lemmas

In order to prove Theorem 7 we make use of the following lemmas to bound the ${\mbox{CV}_{loo}}$ stability using the norms and the difference of the solutions.

Lemma 10

Under assumption (5), for all $i=1.\dots,n$ , it holds that

{\mathbb{E}}_{S}[V(\hat{f}_{S_{i}},z_{i})-V(\hat{f}_{S},z_{i})]\leq{\mathbb{E}}_{S}\left[\left(2M+\kappa\left(\left\lVert{\hat{f}_{S}}\right\rVert_{\cal H}+\left\lVert{\hat{f}_{S_{i}}}\right\rVert_{\cal H}\right)\right)\times\kappa\left\lVert{\hat{f}_{S}-\hat{f}_{S_{i}}}\right\rVert_{\cal H}\right]

Proof We begin, recalling that the square loss is locally Lipschitz, that is for all $y,a,a^{\prime}\in\operatorname{\mathbb{R}}$ , with

|(y-a)^{2}-(y-a^{\prime})^{2}|\leq(2|y|+|a|+|a^{\prime}|))|a-a^{\prime}|.

If we apply this result to $f,f^{\prime}$ in a RKHS ${\cal H}$ ,

|(y-f(\mathbf{x}))^{2}-(y-f^{\prime}(\mathbf{x}))^{2}|\leq\kappa(2M+\kappa\left(\left\lVert{f}\right\rVert_{\cal H}+\left\lVert{f^{\prime}}\right\rVert_{\cal H}\right))\left\lVert{f-f^{\prime}}\right\rVert_{\cal H}.

using the basic properties of a RKHS that for all $f\in{\cal H}$

|f(\mathbf{x})|\leq\left\lVert{f}\right\rVert_{\infty}\leq\kappa\left\lVert{f}\right\rVert_{\cal H}

(7)

In particular, we can plug $\hat{f}_{S_{i}}$ and $\hat{f}_{S}$ into the above inequality, and the almost positivity of ERM (Mukherjee et al., 2006) will allow us to drop the absolute value on the left hand side. Finally the desired result follows by taking the expectation over $S$ .

Now that we have bounded the ${\mbox{CV}_{loo}}$ stability using the norms and the difference of the solutions, we can find a bound on the difference between the solutions to the kernel least squares problem. This is our main stability estimate.

Lemma 11

Let $\hat{f}_{S},\hat{f}_{S_{i}}$ be any interpolating kernel least squares solutions on the full and leave one out datasets (as defined at the top of this section), then $\left\lVert{\hat{f}_{S}-\hat{f}_{S_{i}}}\right\rVert_{{\cal H}}\leq B_{CV}({\mathbf{K}}^{\dagger},\mathbf{y},\mathbf{v},\mathbf{v}_{i})$ , and $B_{CV}$ is minimized when $\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}$ , which corresponds to the minimum norm interpolating solutions $f_{S}^{\dagger},f_{S_{i}}^{\dagger}$ .

Also,

\left\lVert{f_{S}^{\dagger}-f_{S_{i}}^{\dagger}}\right\rVert_{{\cal H}}\leq\left\lVert{{\mathbf{K}}^{\frac{1}{2}}}\right\rVert_{op}\left\lVert{{\mathbf{K}}^{\dagger}}\right\rVert_{op}\times\textrm{cond}({\mathbf{K}})\times\left\lVert{\mathbf{y}}\right\rVert

(8)

Since the minimum norm interpolating solutions minimize both $\left\lVert{\hat{f}_{S}}\right\rVert_{\cal H}+\left\lVert{\hat{f}_{S_{i}}}\right\rVert_{\cal H}$ and $\left\lVert{\hat{f}_{S}-\hat{f}_{S_{i}}}\right\rVert_{\cal H}$ (from lemmas 10, 11), we can put them together to prove theorem 7. In the following section we provide the proof of Lemma 11.

Simulation:

In order to illustrate that the minimum norm interpolating solution is the best performing interpolating solution we ran a simple experiment on a linear regression problem. We synthetically generated data from a linear model $\mathbf{y}=\mathbf{w}^{\top}{\mathbf{X}}$ , where ${\mathbf{X}}\in\operatorname{\mathbb{R}}^{d\times n}$ was i.i.d $\mathcal{N}(0,1)$ . The dimension of the data was $d=1000$ and there were $n=200$ samples in the training dataset. A held out dataset of $50$ samples was used to compute the test mean squared error (MSE). Interpolating solutions were computed as $\hat{\mathbf{w}}^{\top}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})$ and the norm of $\mathbf{v}$ was varied to obtain the plot. The results are shown in Figure 1, where we can see that the training loss is 0 for all interpolants, but test MSE increases as $||\mathbf{v}||$ increases, with $(\mathbf{w}^{\dagger})^{\top}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}$ having the best performance. The figure reports results averaged over $100$ trials.

Refer to caption — Figure 1: Plot of train and test mean squared error (MSE) vs distance between an interpolating solution $\hat{\mathbf{w}}$ and the minimum norm interpolant $\mathbf{w}^{\dagger}$ of a linear regression problem. Data was synthetically generated as $\mathbf{y}=\mathbf{w}^{\top}{\mathbf{X}}$ , where ${\mathbf{X}}\in\operatorname{\mathbb{R}}^{d\times n}$ with i.i.d $\mathcal{N}(0,1)$ entries and $d=1000,n=200$ . Other interpolating solutions were computed as $\hat{\mathbf{w}}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})$ and the norm of $\mathbf{v}$ was varied to obtain the plot. Train MSE is 0 for all interpolants, but test MSE increases as $||\mathbf{v}||$ increases, with $\mathbf{w}^{\dagger}$ having the best performance. This plot represents results averaged over $100$ trials.

4.2 Proof of Lemma 11

We can write any interpolating solution to the kernel regression problem as $\hat{f}_{S}(\mathbf{x})=\sum_{i=1}^{n}\hat{c}_{S,i}K(\mathbf{x}_{i},\mathbf{x})$ where $\hat{\mathbf{c}}_{S}={\mathbf{K}}^{\dagger}\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})\mathbf{v}$ , and ${\mathbf{K}}\in\operatorname{\mathbb{R}}^{n\times n}$ is the kernel matrix $K$ on $S$ and $\mathbf{v}$ is any vector in $\operatorname{\mathbb{R}}^{n}$ . i.e. ${\mathbf{K}}_{ij}=K(\mathbf{x}_{i},\mathbf{x}_{j})$ , and $\mathbf{y}\in\operatorname{\mathbb{R}}^{n}$ is the vector $\mathbf{y}=[y_{1}\ldots y_{n}]^{\top}$ .

Similarly, the coefficient vector for the corresponding interpolating solution to the problem over the leave one out dataset $S_{i}$ is $\hat{\mathbf{c}}_{S_{i}}=({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{y}_{i}+(\mathbf{I}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}$ . Where $\mathbf{y}_{i}=[y_{1},\ldots,0,\ldots y_{n}]^{\top}$ and ${\mathbf{K}}_{S_{i}}$ is the kernel matrix ${\mathbf{K}}$ with the $i^{\textrm{th}}$ row and column set to zero, which is the kernel matrix for the leave one out training set.

We define $\mathbf{a}=[-K(\mathbf{x}_{1},\mathbf{x}_{i}),\ldots,-K(\mathbf{x}_{n},\mathbf{x}_{i})]^{\top}\in\operatorname{\mathbb{R}}^{n}$ and $\mathbf{b}\in\operatorname{\mathbb{R}}^{n}$ as a one-hot column vector with all zeros apart from the $i^{\textrm{th}}$ component which is $1$ . Let $\mathbf{a}_{*}=\mathbf{a}+K(\mathbf{x}_{i},\mathbf{x}_{i})\mathbf{b}$ . Then, we have:

\begin{split}{\mathbf{K}}_{*}&={\mathbf{K}}+\mathbf{a}\mathbf{b}^{\top}\\ {\mathbf{K}}_{S_{i}}&={\mathbf{K}}_{*}+\mathbf{b}\mathbf{a}_{*}^{\top}\end{split}

(9)

That is, we can write ${\mathbf{K}}_{S_{i}}$ as a rank-2 update to ${\mathbf{K}}$ . This can be verified by simple algebra, and using the fact that $K$ is a symmetric kernel. Now we are interested in bounding $||\hat{f}_{S}-\hat{f}_{S_{i}}||_{\cal H}$ . For a function $h(\cdot)=\sum_{i=1}^{m}p_{i}K(\mathbf{x}_{i},\cdot)\in{\cal H}$ we have $||h||_{\cal H}=\sqrt{\mathbf{p}^{\top}{\mathbf{K}}\mathbf{p}}=||{\mathbf{K}}^{\frac{1}{2}}\mathbf{p}||$ . So we have:

\begin{split}||\hat{f}_{S}-\hat{f}_{S_{i}}||_{\cal H}&=||{\mathbf{K}}^{\frac{1}{2}}(\hat{\mathbf{c}}_{S}-\hat{\mathbf{c}}_{S_{i}})||\\ &=||{\mathbf{K}}^{\frac{1}{2}}({\mathbf{K}}^{\dagger}\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})\mathbf{v}-({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{y}_{i}-(\mathbf{I}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i})||\\ &=||{\mathbf{K}}^{\frac{1}{2}}({\mathbf{K}}^{\dagger}\mathbf{y}-({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{y}+y_{i}({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{b}\\ &+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})+({\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i})||\\ &=||{\mathbf{K}}^{\frac{1}{2}}[({\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger})\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})+({\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}]||\end{split}

(10)

Here we make use of the fact that $({\mathbf{K}}_{S_{i}})^{\dagger}\mathbf{b}=\mathbf{0}$ . If ${\mathbf{K}}$ has full rank (as in Remark 2), we see that $\mathbf{b}$ lies in the column space of ${\mathbf{K}}$ and $\mathbf{a}$ lies in the column space of ${\mathbf{K}}^{\top}$ . Furthermore, $\beta=1+\mathbf{b}^{\top}{\mathbf{K}}^{\dagger}\mathbf{a}==0$ . Using Theorem 6 in (Meyer, 1973) (equivalent to formula 2.1 of Baksalary et al. (2003)) with $\mathbf{k}={\mathbf{K}}^{\dagger}\mathbf{a},\mathbf{h}=\mathbf{b}^{\top}{\mathbf{K}}^{\dagger}$ , we obtain:

\begin{split}({\mathbf{K}}_{*})^{\dagger}&={\mathbf{K}}^{\dagger}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{K}}^{\dagger}-{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}+(\mathbf{k}^{\dagger}{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{k}\mathbf{h}\\ &={\mathbf{K}}^{\dagger}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{K}}^{\dagger}-{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}-\mathbf{k}\mathbf{h}\\ &={\mathbf{K}}^{\dagger}-{\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}\end{split}

(11)

Above, we use the fact that the operator norm of a rank 1 matrix is given by $||\mathbf{u}\mathbf{v}^{\top}||_{op}=||\mathbf{u}||\times||\mathbf{v}||$ and that $\mathbf{k}=-\mathbf{b}$ . Also, using the corresponding formula from List 2 of Baksalary et al. (2003), we have ${\mathbf{K}}_{*}^{\dagger}{\mathbf{K}}_{*}={\mathbf{K}}^{\dagger}{\mathbf{K}}-\mathbf{k}\mathbf{k}^{\dagger}$ .

Next, we see that for ${\mathbf{K}}_{S_{i}}$ , $\mathbf{a}_{*}$ lies in the column space of ${\mathbf{K}}_{*}^{\top}$ , and $\mathbf{b}$ does not lie in the column space of ${\mathbf{K}}_{*}$ . This means we can use Theorem 5 in (Meyer, 1973) (equivalent to formula 2.4 in (Baksalary et al., 2003)) to obtain the expression for $({\mathbf{K}}_{S_{i}})^{\dagger}$

({\mathbf{K}}_{S_{i}})^{\dagger}={\mathbf{K}}_{*}^{\dagger}-\nu^{-1}(\phi{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{e}^{\top}+\eta\mathbf{d}\mathbf{f}^{\top})+\nu^{-1}(\lambda{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{f}^{\top}-\lambda\mathbf{d}\mathbf{e}^{\top})

(12)

Also, using the corresponding formula from List 2 of Baksalary et al. (2003), we have $({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}}={\mathbf{K}}_{*}^{\dagger}{\mathbf{K}}_{*}$ , which implies that ${\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}}=\mathbf{k}\mathbf{k}^{\dagger}$ .

Now let us define all the terms in equation (12). We omit some algebraic simplification to save some space.

\begin{split}\mathbf{d}&={\mathbf{K}}_{*}^{\dagger}\mathbf{b}={\mathbf{K}}^{\dagger}(\mathbf{I}-\mathbf{h}^{\dagger}\mathbf{h})\mathbf{b}\\ \mathbf{e}&=({\mathbf{K}}_{*}^{\dagger})^{\top}\mathbf{a}_{*}=(\mathbf{h}^{\dagger}\mathbf{h}-\mathbf{I})\mathbf{b}\\ \mathbf{f}&=(\mathbf{I}-{\mathbf{K}}_{*}{\mathbf{K}}_{*}^{\dagger})\mathbf{b}=\mathbf{h}^{\dagger}\mathbf{h}\mathbf{b}\\ \lambda&=1+\mathbf{a}_{*}^{\top}{\mathbf{K}}_{*}^{\dagger}\mathbf{b}=1+\mathbf{e}^{\top}\mathbf{b}=\mathbf{b}^{\top}\mathbf{h}^{\dagger}\mathbf{h}\mathbf{b}\\ \phi&=\mathbf{f}^{\top}\mathbf{f}=\mathbf{b}^{\top}\mathbf{h}^{\dagger}\mathbf{h}\mathbf{b}=\lambda\\ \eta&=\mathbf{e}^{\top}\mathbf{e}=1-\lambda\\ \nu&=\lambda^{2}+\eta\phi=\lambda\end{split}

(13)

This means we can simplify equation (12) as

\begin{split}{\mathbf{K}}_{*}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger}&=\nu^{-1}(\phi{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{e}^{\top}+\eta\mathbf{d}\mathbf{f}^{\top})-\nu^{-1}(\lambda{\mathbf{K}}_{*}^{\dagger}\mathbf{e}\mathbf{f}^{\top}-\lambda\mathbf{d}\mathbf{e}^{\top})\\ &=\lambda^{-1}\left(\lambda{\mathbf{K}}_{*}^{\dagger}\mathbf{e}(\mathbf{e}-\mathbf{f})^{\top}+\eta\mathbf{d}\mathbf{f}^{\top}+\lambda\mathbf{d}\mathbf{e}^{\top}\right)\\ &={\mathbf{K}}_{*}^{\dagger}\mathbf{e}(\mathbf{e}-\mathbf{f})^{\top}+\mathbf{d}(\mathbf{e}-\mathbf{f})^{\top}+\lambda^{-1}\mathbf{d}\mathbf{f}^{\top}\end{split}

(14)

Putting together (11), (12) (and after some algebraic simplification) we get:

\begin{split}{\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger}&={\mathbf{K}}^{\dagger}-{\mathbf{K}}_{*}\dagger+{\mathbf{K}}_{*}\dagger-({\mathbf{K}}_{S_{i}})^{\dagger}\\ &={\mathbf{K}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}+\lambda^{-1}{\mathbf{K}}^{\dagger}(\mathbf{I}-\mathbf{h}^{\dagger}\mathbf{h})\mathbf{b}\mathbf{b}^{\top}\mathbf{h}^{\dagger}\mathbf{h}\\ &=({\mathbf{K}}^{\dagger}_{ii})^{-1}\mathbf{h}^{\top}\mathbf{h}\end{split}

(15)

Plugging in these calculations into equation 10 we get:

\begin{split}||\hat{f}_{S}-\hat{f}_{S_{i}}||_{\cal H}&=||{\mathbf{K}}^{\frac{1}{2}}[({\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger})\mathbf{y}+(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})+({\mathbf{K}}^{\dagger}{\mathbf{K}}-({\mathbf{K}}_{S_{i}})^{\dagger}{\mathbf{K}}_{S_{i}})\mathbf{v}_{i}]||\\ &\leq||{\mathbf{K}}^{\frac{1}{2}}||_{op}\left(||({\mathbf{K}}^{\dagger}-({\mathbf{K}}_{S_{i}})^{\dagger})\mathbf{y}||+||(\mathbf{I}-{\mathbf{K}}^{\dagger}{\mathbf{K}})(\mathbf{v}-\mathbf{v}_{i})||+||\mathbf{k}\mathbf{k}^{\dagger}\mathbf{v}_{i}||\right)\\ &\leq||{\mathbf{K}}^{\frac{1}{2}}||_{op}(B_{0}+2||\mathbf{v}-\mathbf{v}_{i}||+||\mathbf{v}_{i}||)\end{split}

(16)

We see that the right hand side is minimized when $\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}$ . We have also computed $B_{0}=||{\mathbf{K}}^{\dagger}||_{op}\times cond({\mathbf{K}})\times||\mathbf{y}||$ , which concludes the proof of Lemma 11.

5 Remark and Related Work

In the previous section we obtained bounds on the ${\mbox{CV}_{loo}}$ stability of interpolating solutions to the kernel least squares problem. Our kernel least squares results can be compared with stability bounds for regularized ERM (see Remark 3). Regularized ERM has a strong stability guarantee in terms of a uniform stability bound which turns out to be inversely proportional to the regularization parameter $\lambda$ and the sample size $n$ (Bousquet & Elisseeff, 2001). However, this estimate becomes vacuous as $\lambda\to 0$ . In this paper, we establish a bound on average stability, and show that this bound is minimized when the minimum norm ERM solution is chosen. We study average stability since one can expect worst case scenarios where the minimum norm is arbitrarily large (when $n\approx d$ ). Our main finding is in fact the relationship between minimizing the norm of the ERM solution and minimizing a bound on stability.

This leads to a second observation, namely, that we can consider the limit of our risk bounds as both the sample size ( $n$ ) and the dimensionality of the data ( $d$ ) go to infinity. This is a classical setting in statistics which allows us to use results from random matrix theory (Marchenko & Pastur, 1967). In particular, for linear kernels the behavior of the smallest eigenvalue of the kernel matrix (which appears in our bounds) can be characterized in this asymptotic limit. Here the dimension of the data coincides with the number of parameters in the model. Interestingly, analogous results hold for more general kernels (El Karoui, 2010) where the asymptotics are taken with respect to the number and dimensionality of the data. These results predict a double descent curve for the condition number as found in practice, see Figure 2.

Recently, there has been a surge of interest in studying linear and kernel least squares models, since classical results focus on situations where constraints or penalties that prevent interpolation are added to the empirical risk. For example, high dimensional linear regression is considered in Mei & Montanari (2019); Hastie et al. (2019); Bartlett et al. (2019), and “ridgeless” kernel least squares is studied in Liang et al. (2019); Rakhlin & Zhai (2018) and Liang et al. (2020). While these papers study upper and lower bounds on the risk of interpolating solutions to the linear and kernel least squares problem, ours are the first to derived using stability arguments. While it might be possible to obtain tighter excess risk bounds through careful analysis of the minimum norm interpolant, our simple approach helps us establish a link between stability in statistical and in numerical sense.

Finally, we can compare our results with observations made in Poggio et al. (2019) on the condition number of random kernel matrices. The condition number of the empirical kernel matrix is known to control the numerical stability of the solution to a kernel least squares problem. Our results show that the statistical stability is also controlled by the condition number of the kernel matrix, providing a natural link between numerical and statistical stability.

6 Conclusions

In summary, minimizing a bound on cross validation stability minimizes the expected error in both the classical and the modern regime of ERM. In the classical regime ( $d<n$ ), ${\mbox{CV}_{loo}}$ stability implies generalization and consistency for $n\to\infty$ . In the modern regime ( $d>n$ ), as described in this paper, ${\mbox{CV}_{loo}}$ stability can account for the double descent curve in kernel interpolants ((Belkin et al., 2019) showed empirical evidence for closely related interpolation methods) under appropriate distributional assumptions. The main contribution of this paper is characterizing stability of interpolating solutions, in particular deriving excess risk bounds via a stability argument. In the process, we show that among all the interpolating solutions, the one with minimal norm also minimizes a bound on stability. Since the excess risk bounds of the minimum norm interpolant depend on the pseudoinverse of the kernel matrix, we establish here a clear link between numerical and statistical stability. This also holds for solutions computed by gradient descent, since gradient descent converges to minimum norm solutions in the case of “linear” kernel methods. Our approach is simple and combines basic stability results with matrix inequalities.

References

Baksalary et al. (2003) Jerzy K Baksalary, Oskar Maria Baksalary, and Götz Trenkler. A revisitation of formulae for the moore–penrose inverse of modified matrices. Linear Algebra and Its Applications, 372:207–224, 2003.
Bartlett et al. (2019) Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. CoRR, abs/1906.11300, 2019. URL http://arxiv.org/abs/1906.11300.
Belkin et al. (2019) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019. ISSN 0027-8424. doi: 10.1073/pnas.1903070116. URL https://www.pnas.org/content/116/32/15849.
Boucheron et al. (2005) Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9:323–375, 2005.
Bousquet & Elisseeff (2001) O. Bousquet and A. Elisseeff. Stability and generalization. Journal Machine Learning Research, 2001.
Bühlmann & Van De Geer (2011) Peter Bühlmann and Sara Van De Geer. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media, 2011.
El Karoui (2010) Noureddine El Karoui. The spectrum of kernel random matrices. arXiv e-prints, art. arXiv:1001.0492, Jan 2010.
Hastie et al. (2019) Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani. Surprises in High-Dimensional Ridgeless Least Squares Interpolation. arXiv e-prints, art. arXiv:1903.08560, Mar 2019.
Kutin & Niyogi (2002) S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. Technical report TR-2002-03, University of Chicago, 2002.
Liang et al. (2019) Tengyuan Liang, Alexander Rakhlin, and Xiyu Zhai. On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels. arXiv e-prints, art. arXiv:1908.10292, Aug 2019.
Liang et al. (2020) Tengyuan Liang, Alexander Rakhlin, et al. Just interpolate: Kernel “ridgeless” regression can generalize. Annals of Statistics, 48(3):1329–1347, 2020.
Marchenko & Pastur (1967) V. A. Marchenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices. Mat. Sb. (N.S.), 72(114):4:457–483, 1967.
Mei & Montanari (2019) Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv e-prints, art. arXiv:1908.05355, Aug 2019.
Meyer (1973) Carl Meyer. Generalized inversion of modified matrices. SIAM J. Applied Math, 24:315–323, 1973.
Micchelli (1986) C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constructive Approximation, 2:11–22, 1986.
Mukherjee et al. (2006) Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006. ISSN 1572-9044. doi: 10.1007/s10444-004-7634-z. URL http://dx.doi.org/10.1007/s10444-004-7634-z.
Poggio et al. (2004) T. Poggio, R. Rifkin, S. Mukherjee, and P. Niyogi. General conditions for predictivity in learning theory. Nature, 428:419–422, March 2004.
Poggio et al. (2019) T. Poggio, G. Kur, and A. Banburski. Double descent in the condition number. Technical report, MIT Center for Brains Minds and Machines, 2019.
Poggio (2020) Tomaso Poggio. Stable foundations for learning. Center for Brains, Minds and Machines (CBMM) Memo No. 103, 2020.
Rakhlin & Zhai (2018) Alexander Rakhlin and Xiyu Zhai. Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon. arXiv e-prints, art. arXiv:1812.11167, Dec 2018.
Rosasco & Villa (2015) Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 1630–1638. Curran Associates, Inc., 2015. URL http://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization.pdf.
Shalev-Shwartz & Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA, 2014. ISBN 1107057132.
Shalev-Shwartz et al. (2010) Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. J. Mach. Learn. Res., 11:2635–2670, December 2010. ISSN 1532-4435.
Steinwart & Christmann (2008) Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.

Appendix A Excess Risk, Generalization, and Stability

We use the same notation as introduced in Section 2 for the various quantities considered in this section. That is in the supervised learning setup $V(f,z)$ is the loss incurred by hypothesis $f$ on the sample $z$ , and $I[f]={\mathbb{E}}_{z}[V(f,z)]$ is the expected error of hypothesis $f$ . Since we are interested in different forms of stability, we will consider learning problems over the original training set $S=\{z_{1},z_{2},\ldots,z_{n}\}$ , the leave one out training set $S_{i}=\{z_{1},\ldots,z_{i-1},z_{i+1},\ldots,z_{n}\}$ , and the replace one training set $(S_{i},z)=\{z_{1},\ldots,z_{i-1},z_{i+1},\ldots,z_{n},z\}$

A.1 Replace one and leave one out algorithmic stability

Similar to the definition of expected ${\mbox{CV}_{loo}}$ stability in equation (3) of the main paper, we say an algorithm is cross validation replace one stable (in expectation), denoted as ${\mbox{CV}_{ro}}$ , if there exists $\beta_{ro}>0$ such that

{\mathbb{E}}_{S,z}[V(f_{S},z)-V(f_{(S_{i},z)},z)]\leq\beta_{ro}.

We can strengthen the above stability definition by introducing the notion of replace one algorithmic stability (in expectation) Bousquet & Elisseeff (2001). There exists $\alpha_{ro}>$ such that for all $i=1,\dots,n$ ,

{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{(S_{i},z)}}\right\rVert_{\infty}]\leq\alpha_{ro}.

We make two observations:
First, if the loss is Lipschitz, that is if there exists $C_{V}>0$ such that for all $f,f^{\prime}\in{\cal H}$

\left\lVert{V(f,z)-V(f^{\prime},z)}\right\rVert\leq C_{V}\left\lVert{f-f^{\prime}}\right\rVert,

then replace one algorithmic stability implies ${\mbox{CV}_{ro}}$ stability with $\beta_{ro}=C_{V}\alpha_{ro}$ . Moreover, the same result holds if the loss is locally Lipschitz and there exists $R>0$ , such that $\left\lVert{f_{S}}\right\rVert_{\infty}\leq R$ almost surely. In this latter case the Lipschitz constant will depend on $R$ . Later, we illustrate this situation for the square loss.

Second, we have for all $i=1,\dots,n$ , $S$ and $z$ ,

{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{(S_{i},z)}}\right\rVert_{\infty}]\leq{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{S_{i}}}\right\rVert_{\infty}]+{\mathbb{E}}_{S,z}[\left\lVert{f_{(S_{i},z)}-f_{S_{i}}}\right\rVert_{\infty}].

This observation motivates the notion of leave one out algorithmic stability (in expectation) Bousquet & Elisseeff (2001)]

{\mathbb{E}}_{S,z}[\left\lVert{f_{S}-f_{S_{i}}}\right\rVert_{\infty}]\leq\alpha_{loo}.

Clearly, leave one out algorithmic stability implies replace one algorithmic stability with $\alpha_{ro}=2\alpha_{loo}$ and it implies also ${\mbox{CV}_{ro}}$ stability with $\beta_{ro}=2C_{V}\alpha_{loo}$ .

A.2 Excess Risk and ${\mbox{CV}_{loo}}$ , ${\mbox{CV}_{ro}}$ Stability

We recall the statement of Lemma 5 in section 3 that bounds the excess risk using the ${\mbox{CV}_{loo}}$ stability of a solution.

Lemma 12 (Excess Risk & ${\mbox{CV}_{loo}}$ Stability)

For all $i=1,\dots,n$ ,

{\mathbb{E}}_{S}[I[f_{S_{i}}]-\inf_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})].

(17)

In this section, two properties of ERM are useful, namely symmetry, and a form of unbiasedeness.

Symmetry.

A key property of ERM is that it is symmetric with respect to the data set $S$ , meaning that it does not depend on the order of the data in $S$ .

A second property relates the expected ERM with the minimum of expected risk.

ERM Bias.

The following inequality holds.

{\mathbb{E}}[[I_{S}[f_{S}]]-\min_{f\in{\cal H}}I[f]\leq 0.

(18)

To see this, note that

I_{S}[f_{S}]\leq I_{S}[f]

for all $f\in{\cal H}$ by definition of ERM, so that taking the expectation of both sides

{\mathbb{E}}_{S}[I_{S}[f_{S}]]\leq{\mathbb{E}}_{S}[I_{S}[f]]=I[f]

for all $f\in{\cal H}$ . This implies

{\mathbb{E}}_{S}[I_{S}[f_{S}]]\leq\min_{f\in{\cal H}}I[f]

and hence (18) holds.

Remark 13

Note that the same argument gives more generally that

{\mathbb{E}}[\inf_{f\in{\cal H}}[I_{S}[f]]-\inf_{f\in{\cal H}}I[f]\leq 0.

(19)

Given the above premise, the proof of Lemma 5 is simple.

Proof [of Lemma 5] Adding and subtracting ${\mathbb{E}}_{S}[I_{S}[f_{S}]]$ from the expected excess risk we have that

{\mathbb{E}}_{S}[I[f_{S_{i}}]-\min_{f\in{\cal H}}I[f]]={\mathbb{E}}_{S}[I[f_{S_{i}}]-I_{S}[f_{S}]+I_{S}[f_{S}]-\min_{f\in{\cal H}}I[f]],

(20)

and since ${\mathbb{E}}_{S}[I_{S}[f_{S}]]-\min_{f\in{\cal H}}I[f]]$ is less or equal than zero, see (19), then

{\mathbb{E}}_{S}[I[f_{S_{i}}]-\min_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[I[f_{S_{i}}]-I_{S}[f_{S}]].

(21)

Moreover, for all $i=1,\dots,n$

{\mathbb{E}}_{S}[I[f_{S_{i}}]]={\mathbb{E}}_{S}[{\mathbb{E}}_{z_{i}}V(f_{S_{i}},z_{i})]={\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})]

and

{\mathbb{E}}_{S}[I_{S}[f_{S}]]=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}_{S}[V(f_{S},z_{i})]={\mathbb{E}}_{S}[V(f_{S},z_{i})].

Plugging these last two expressions in (21) and in (20) leads to (4).

We can prove a similar result relating excess risk with ${\mbox{CV}_{ro}}$ stability.

Lemma 14 (Excess Risk & ${\mbox{CV}_{ro}}$ Stability)

Given the above definitions, the following inequality holds for all $i=1,\dots,n$ ,

{\mathbb{E}}_{S}[I[f_{S}]-\inf_{f\in{\cal H}}I[f]]\leq{\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]]={\mathbb{E}}_{S,z}[V(f_{S},z)-V(f_{(S_{i},z)},z)].

(22)

Proof The first inequality is clear from adding and subtracting $I_{S}[f_{S}]$ from the expected risk $I[f_{S}]$ we have that

{\mathbb{E}}_{S}[I[f_{S}]-\min_{f\in{\cal H}}I[f]]={\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]+I_{S}[f_{S}]-\min_{f\in{\cal H}}I[f]],

and recalling (19). The main step in the proof is showing that for all $i=1,\dots,n$ ,

{\mathbb{E}}[I_{S}[f_{S}]]={\mathbb{E}}[V(f_{(S_{i},z)},z)]

(23)

to be compared with the trivial equality, ${\mathbb{E}}[I_{S}[f_{S}]={\mathbb{E}}[V(f_{S},z_{i})]$ . To prove Equation (23), we have for all $i=1,\dots,n$ ,

{\mathbb{E}}_{S}[I_{S}[f_{S}]]={\mathbb{E}}_{S,z}[\frac{1}{n}\sum_{i=1}^{n}V(f_{S},z_{i})]=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}_{S,z}[V(f_{(S_{i},z)},z)]={\mathbb{E}}_{S,z}[V(f_{(S_{i},z)},z)]

where we used the fact that by the symmetry of the algorithm ${\mathbb{E}}_{S,z}[V(f_{(S_{i},z)},z)]$ is the same for all $i=1,\dots,n$ . The proof is concluded noting that ${\mathbb{E}}_{S}[I[f_{S}]]={\mathbb{E}}_{S,z}[V(f_{S},z)]$ .

A.3 Discussion on Stability and Generalization

Below we discuss some more aspects of stability and its connection to other quantities in statistical learning theory.

Remark 15 ( ${\mbox{CV}_{loo}}$ stability in expectation and in probability)

In Mukherjee et al. (2006), ${\mbox{CV}_{loo}}$ stability is defined in probability, that is there exists $\beta^{P}_{CV}>0$ , $0<\delta^{P}_{CV}\leq 1$ such that

{\mathbb{P}}_{S}\{|V(f_{S_{i}},z_{i})-V(f_{S},z_{i})|\geq\beta^{P}_{CV}\}\leq\delta^{P}_{CV}.

Note that the absolute value is not needed for ERM since almost positivity holds Mukherjee et al. (2006), that is $V(f_{S_{i}},z_{i})-V(f_{S},z_{i})>0.$ Then ${\mbox{CV}_{loo}}$ stability in probability and in expectation are clearly related and indeed equivalent for bounded loss functions. ${\mbox{CV}_{loo}}$ stability in expectation (3) is what we study in the following sections.

Remark 16 (Connection to uniform stability and other notions of stability)

Remark 17 ( ${\mbox{CV}_{loo}}$ Stability & Learnability)

A natural question is to which extent suitable notions of stability are not only sufficient but also necessary for controlling the excess risk of ERM. Classically, the latter is characterized in terms of a uniform version of the law of large numbers, which itself can be characterized in terms of suitable complexity measures of the hypothesis class. Uniform stability is too strong to characterize consistency while ${\mbox{CV}_{loo}}$ stability turns out to provide a suitably weak definition as shown in Mukherjee et al. (2006), see also Kutin & Niyogi (2002), Mukherjee et al. (2006). Indeed, a main result in Mukherjee et al. (2006) shows that ${\mbox{CV}_{loo}}$ stability is equivalent to consistency of ERM:

Theorem 18

Mukherjee et al. (2006) For ERM and bounded loss functions, ${\mbox{CV}_{loo}}$ stability in probability with $\beta^{P}_{CV}$ converging to zero for $n\to\infty$ is equivalent to consistency and generalization of ERM.

Remark 19 ( ${\mbox{CV}_{loo}}$ stability & in-sample/out-of-sample error)

Let $(S,z)=\{z_{1},\dots,z_{n},z\},$ ( $z$ is a data point drawn according to the same distribution) and the corresponding ERM solution $f_{(S,z)}$ , then (4) can be equivalently written as,

{\mathbb{E}}_{S}[I[f_{S}]-\inf_{f\in\cal F}I[f]]\leq{\mathbb{E}}_{S,z}[V(f_{S},z)-V(f_{(S,z)},z)].

Thus ${\mbox{CV}_{loo}}$ stability measures how much the loss changes when we test on a point that is present in the training set and absent from it. In this view, it can be seen as an average measure of the difference between in-sample and out-of-sample error.

Remark 20 ( ${\mbox{CV}_{loo}}$ stability and generalization)

A common error measure is the (expected) generalization gap ${\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]].$ For non-ERM algorithms, ${\mbox{CV}_{loo}}$ stability by itself not sufficient to control this term, and further conditions are needed Mukherjee et al. (2006), since

{\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S}]]={\mathbb{E}}_{S}[I[f_{S}]-I_{S}[f_{S_{i}}]]+{\mathbb{E}}_{S}[I_{S}[f_{S_{i}}]-I_{S}[f_{S}]].

The second term becomes for all $i=1,\dots,n$ ,

{\mathbb{E}}_{S}[I_{S}[f_{S_{i}}]-I_{S}[f_{S}]]=\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]={\mathbb{E}}_{S}[V(f_{S_{i}},z_{i})-V(f_{S},z_{i})]

and hence is controlled by CV stability. The first term is called expected leave one out error in Mukherjee et al. (2006) and is controlled in ERM as $n\to\infty$ , see Theorem 18 above.

Appendix B ${\mbox{CV}_{loo}}$ Stabililty of Linear Regression

We have a dataset $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ and we want to find a mapping $\mathbf{w}\in\operatorname{\mathbb{R}}^{d}$ , that minimizes the empirical least squares risk. All interpolating solutions are of the form $\hat{\mathbf{w}}_{S}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})$ . Similarly, all interpolating solutions on the leave one out dataset $S_{i}$ can be written as $\hat{\mathbf{w}}_{S_{i}}=\mathbf{y}_{i}^{\top}({\mathbf{X}}_{i})^{\dagger}+\mathbf{v}_{i}^{\top}(\mathbf{I}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger})$ . Here ${\mathbf{X}},{\mathbf{X}}_{i}\in\operatorname{\mathbb{R}}^{d\times n}$ are the data matrices for the original and leave one out datasets respectively. We note that when $\mathbf{v}=\mathbf{0}$ and $\mathbf{v}_{i}=\mathbf{0}$ , we obtain the minimum norm interpolating solutions on the datasets $S$ and $S_{i}$ .

In this section we want to estimate the ${\mbox{CV}_{loo}}$ stability of the minimum norm solution to the ERM problem in the linear regression case. This is the case outlined in Remark 9 of the main paper. In order to prove Remark 9, we only need to combine Lemma 10 with the linear regression analogue of Lemma 11. We state and prove that result in this section. This result predicts a double descent curve for the norm of the pseudoinverse as found in practice, see Figure 3.

Lemma 21

Let $\hat{\mathbf{w}}_{S},\hat{\mathbf{w}}_{S_{i}}$ be any interpolating least squares solutions on the full and leave one out datasets $S,S_{i}$ , then $\left\lVert{\hat{\mathbf{w}}_{S}-\hat{\mathbf{w}}_{S_{i}}}\right\rVert\leq B_{CV}({\mathbf{X}}^{\dagger},\mathbf{y},\mathbf{v},\mathbf{v}_{i})$ , and $B_{CV}$ is minimized when $\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}$ , which corresponds to the minimum norm interpolating solutions $\mathbf{w}_{S}^{\dagger},\mathbf{w}_{S_{i}}^{\dagger}$ .

Also,

\left\lVert{\mathbf{w}_{S}^{\dagger}-\mathbf{w}_{S_{i}}^{\dagger}}\right\rVert\leq\left\lVert{{\mathbf{X}}^{\dagger}}\right\rVert_{op}\left\lVert{\mathbf{y}}\right\rVert

(24)

As mentioned before in section 2.1 of the main paper, linear regression can be viewed as a case of the kernel regression problem where ${\cal H}=\operatorname{\mathbb{R}}^{d}$ , and the feature map $\Phi$ is the identity map. The inner product and norms considered in this case are also the usual Euclidean inner product and $2$ -norm for vectors in $\operatorname{\mathbb{R}}^{d}$ . The notation $\left\lVert{\cdot}\right\rVert$ denotes the Euclidean norm for vectors both in $\operatorname{\mathbb{R}}^{d}$ and $\operatorname{\mathbb{R}}^{n}$ . The usage of the norm should be clear from the context. Also, $\left\lVert{\mathbf{A}}\right\rVert_{op}$ is the left operator norm for a matrix $\mathbf{A}\in\operatorname{\mathbb{R}}^{n\times d}$ , that is $\left\lVert{\mathbf{A}}\right\rVert_{op}=\sup_{\mathbf{y}\in\operatorname{\mathbb{R}}^{n},||\mathbf{y}||=1}||\mathbf{y}^{\top}\mathbf{A}||$ .

We have $n$ samples in the training set for a linear regression problem, $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ . We collect all the samples into a single matrix/vector ${\mathbf{X}}=[\mathbf{x}_{1}\mathbf{x}_{2}\ldots\mathbf{x}_{n}]\in\operatorname{\mathbb{R}}^{d\times n}$ , and $\mathbf{y}=[y_{1}y_{2}\ldots y_{n}]^{\top}\in\operatorname{\mathbb{R}}^{n}$ . Then any interpolating ERM solution $\mathbf{w}_{S}$ satisfies the linear equation

\mathbf{w}_{S}^{\top}{\mathbf{X}}=\mathbf{y}^{\top}

(25)

Any interpolating solution can be written as:

(\hat{\mathbf{w}}_{S})^{\top}=\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}^{\top}(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}).

(26)

If we consider the leave one out training set $S_{i}$ we can find the minimum norm ERM solution for ${\mathbf{X}}_{i}=[\mathbf{x}_{1}\ldots\mathbf{0}\ldots\mathbf{x}_{n}]$ and $\mathbf{y}_{i}=[y_{1}\ldots 0\ldots y_{n}]^{\top}$ as

(\hat{\mathbf{w}}_{S_{i}})^{\top}=\mathbf{y}_{i}^{\top}({\mathbf{X}}_{i})^{\dagger}+\mathbf{v}_{i}^{\top}(\mathbf{I}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}).

(27)

We can write ${\mathbf{X}}_{i}$ as:

{\mathbf{X}}_{i}={\mathbf{X}}+\mathbf{a}\mathbf{b}^{\top}

(28)

where $\mathbf{a}\in\operatorname{\mathbb{R}}^{d}$ is a column vector representing the additive change to the $i^{\textrm{th}}$ column, i.e, $\mathbf{a}=-\mathbf{x}_{i}$ , and $\mathbf{b}\in\operatorname{\mathbb{R}}^{n\times 1}$ is the $i-$ th element of the canonical basis in $\operatorname{\mathbb{R}}^{n}$ (all the coefficients are zero but the $i-$ th which is one). Thus $\mathbf{a}\mathbf{b}^{\top}$ is a $d\times n$ matrix composed of all zeros apart from the $i^{\textrm{th}}$ column which is equal to $\mathbf{a}$ .

We also have $\mathbf{y}_{i}=\mathbf{y}-y_{i}\mathbf{b}$ . Now per Lemma 10 we are interested in bounding the quantity $||\hat{\mathbf{w}}_{S_{i}}-\hat{\mathbf{w}}_{S}||=||(\hat{\mathbf{w}}_{S_{i}})^{\top}-(\hat{\mathbf{w}}_{S})^{\top}||$ . This simplifies to:

\begin{split}||\hat{\mathbf{w}}_{S_{i}}-\hat{\mathbf{w}}_{S}||&=||\mathbf{y}_{i}^{\top}({\mathbf{X}}_{i})^{\dagger}-\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||(\mathbf{y}^{\top}-y_{i}\mathbf{b}^{\top})({\mathbf{X}}_{i})^{\dagger}-\mathbf{y}^{\top}{\mathbf{X}}^{\dagger}+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+y_{i}\mathbf{b}^{\top}({\mathbf{X}}_{i})^{\dagger}+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top}+\mathbf{v}^{\top}{\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{v}_{i}^{\top}{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}||\\ &=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+(\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top})(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})-\mathbf{v}_{i}^{\top}({\mathbf{X}}{\mathbf{X}}^{\dagger}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger})||\end{split}

(29)

In the above equation we make use of the fact that $\mathbf{b}^{\top}({\mathbf{X}}_{i})^{\dagger}=\mathbf{0}$ . We use an old formula (Meyer, 1973; Baksalary et al., 2003) to compute $({\mathbf{X}}_{i})^{\dagger}$ from ${\mathbf{X}}^{\dagger}$ . We use the development of pseudo-inverses of perturbed matrices in Meyer (1973). We see that $\mathbf{a}=-\mathbf{x}_{i}$ is a vector in the column space of ${\mathbf{X}}$ and $\mathbf{b}$ is in the range space of ${\mathbf{X}}^{T}$ (provided ${\mathbf{X}}$ has full column rank), with $\beta=1+\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}\mathbf{a}=1-\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}\mathbf{x}_{i}=0$ . This means we can use Theorem 6 in Meyer (1973) (equivalent to formula 2.1 in Baksalary et al. (2003)) to obtain the expression for $({\mathbf{X}}_{i})^{\dagger}$

({\mathbf{X}}_{i})^{\dagger}={\mathbf{X}}^{\dagger}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}-{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}+(\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{k}\mathbf{h}

(30)

where $\mathbf{k}={\mathbf{X}}^{\dagger}\mathbf{a}$ , and $\mathbf{h}=\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}$ , and $\mathbf{u}^{\dagger}=\frac{\mathbf{u}^{\top}}{||\mathbf{u}||^{2}}$ for any non-zero vector $\mathbf{u}$ .

\begin{split}({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger}&=(\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{k}\mathbf{h}-\mathbf{k}\mathbf{k}^{\dagger}{\mathbf{X}}^{\dagger}-{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}\\ &=(\mathbf{b}^{\top}{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger})\mathbf{b}\mathbf{h}-\mathbf{b}\mathbf{h}-{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}\\ \implies||({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger}||_{op}&=||{\mathbf{X}}^{\dagger}\mathbf{h}^{\dagger}\mathbf{h}||_{op}\\ &\leq||{\mathbf{X}}^{\dagger}||_{op}\end{split}

(31)

The above set of inequalities follows from the fact that the operator norm of a rank 1 matrix is given by $||\mathbf{u}\mathbf{v}^{\top}||_{op}=||\mathbf{u}||\times||\mathbf{v}||$ , and by noticing that $\mathbf{k}=-\mathbf{b}$ .

Also, from List 2 of Baksalary et al. (2003) we have that ${\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger}={\mathbf{X}}{\mathbf{X}}^{\dagger}-\mathbf{h}^{\dagger}\mathbf{h}$ .

Plugging in these calculations into equation 29 we get:

\begin{split}||\hat{\mathbf{w}}_{S_{i}}-\hat{\mathbf{w}}_{S}||&=||\mathbf{y}^{\top}(({\mathbf{X}}_{i})^{\dagger}-{\mathbf{X}}^{\dagger})+(\mathbf{v}_{i}^{\top}-\mathbf{v}^{\top})(\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger})-\mathbf{v}_{i}^{\top}({\mathbf{X}}{\mathbf{X}}^{\dagger}-{\mathbf{X}}_{i}({\mathbf{X}}_{i})^{\dagger})||\\ &\leq B_{0}+||\mathbf{I}-{\mathbf{X}}{\mathbf{X}}^{\dagger}||_{op}||\mathbf{v}-\mathbf{v}_{i}||+||\mathbf{v}_{i}||\times||\mathbf{h}^{\dagger}\mathbf{h}||_{op}\\ &\leq B_{0}+2||\mathbf{v}-\mathbf{v}_{i}||+||\mathbf{v}_{i}||\end{split}

(32)

We see that the right hand side is minimized when $\mathbf{v}=\mathbf{v}_{i}=\mathbf{0}$ . We can also compute $B_{0}=||{\mathbf{X}}^{\dagger}||_{op}||\mathbf{y}||$ , which concludes the proof of Lemma 21.