A finite sample analysis of generalisation for minimum norm interpolators in the ridge function model with random design

Emmanuel Caron Laboratoire de Mathématiques d’Avignon (LMA), Université d’Avignon Campus Jean-Henri Fabre 301, rue Baruch de Spinoza, BP 21239 F-84 916 AVIGNON Cedex 9. Most of this work was done while the author was with the ERIC Laboratory, University of Lyon 2, Lyon, France Stéphane Chrétien ERIC Laboratory and UFR ASSP, University of Lyon 2, 5 avenue Mendes France, 69676 Bron Cedex. He is invited researcher at the Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK, the National Physical Laboratory, 1 Hampton Road, Teddington, TW11 0LW, UK, The Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB, UK, and FEMTO-ST Institute/Optics Department, 15B Avenue des Montboucons, F-25030 Besançon, France.

Abstract

Recent extensive numerical experiments in high scale machine learning have allowed to uncover a quite counterintuitive phase transition, as a function of the ratio between the sample size and the number of parameters in the model. As the number of parameters $p$ approaches the sample size $n$ , the generalisation error increases, but surprisingly, it starts decreasing again past the threshold $p=n$ . This phenomenon, brought to the theoretical community attention in [2], has been thoroughly investigated lately, more specifically for simpler models than deep neural networks, such as the linear model when the parameter is taken to be the minimum norm solution to the least-squares problem, firstly in the asymptotic regime when $p$ and $n$ tend to infinity, see e.g. [8], and recently in the finite dimensional regime and more specifically for linear models [1], [16], [9]. In the present paper, we propose a finite sample analysis of non-linear models of ridge type, where we investigate the overparametrised regime of the double descent phenomenon for both the estimation problem and the prediction problem. Our results provide a precise analysis of the distance of the best estimator from the true parameter as well as a generalisation bound which complements recent works of [1] and [6]. Our analysis is based on tools closely related to the continuous Newton method [14] and a refined quantitative analysis of the performance in prediction of the minimum $\ell_{2}$ -norm solution.

1 Introduction

The tremendous achievements in deep learning theory and practice have received great attention in the applied Computer Science, Artificial Intelligence and Statistics communities in the recent years. Many success stories related to the use of Deep Neural Networks have even been reported in the media and most data scientists are proficient with the Deep Learning tools available via opensource machine learning libraries such as Tensorflow, Keras, Pytorch and many others.

One of the key ingredient in their success is the huge number of parameters involved in all current architectures. While enabling unprecedented expressivity capabilities, such regimes of overparametrisations appear very counterintuitive through the lens of traditional statistical wisdom. Indeed, as intuition suggests, overparametrisation often results in interpolation, i.e. zero training error. The expected outcome of such endeavors should result in very poor generalisation performance. Nevertheless interpolating deep neural networks still generalise well despite achieving zero or very small training error.

Belkin et al. [2] recently addressed the problem of resolving this paradox, and brought some new light on the complex and perplexing relationships between interpolation and generalization. In the linear model setting, in the regime where the weights are the minimum norm solution to the least-squares problem, overfitting was proved to be benign under strong design assumptions (i.i.d.) random matrices, when $p$ and $n$ tend to infinity, see e.g. [8]. More precise results were recently obtained in the finite dimensional regime and more specifically for linear models [1], [16], [9]. Non-linear settings have also been considered such as in the particular instance of kernel ridge regression as studied in [3], [12], [10] proved that interpolation can coexist with good generalization. More recent results in the nonlinear direction are, e.g. [7] for shallow neural networks using a very general framework that combines the properties of gradient descent for a binary classification problem, but somehow restrictive assumptions on the dimension of the input of the form $n\|\mu\|^{4}\gg p\geq C\max\left\{n\|\mu\|^{2},n^{2}\log(n/\delta)\right\}$ . In this paper, we consider a statistical model where $\phi(X;\theta)=f(X_{i}^{\top}\theta)$ , i.e. a model of the form

\displaystyle\mathbb{E}[Y_{i}\mid X_{i}]

\displaystyle=f(X_{i}^{\top}\theta^{\ast}),\qquad i=1,\ldots,n,

(1)

where $\theta^{*}\in\mathbb{R}^{p}$ and the function $f$ is assumed increasing. This setting is also known as the single index model and rigde function model. When the estimation of $\theta^{*}$ is performed using Empirical Risk Estimation, i.e. by solving

\displaystyle\hat{\theta}

\displaystyle=\mathrm{argmin}_{\theta\in\Theta}\ \frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(X_{i}^{\top}\theta))

(2)

for a given smooth loss function $\ell$ , we derive an upper bound on the prediction error that gives precise order of dependencies with respect to all the intrinsic parameters of the model, such as the dimensions $n$ and $p$ , as well as various bounds on the derivatives of $f$ and of the loss function used in the Empirical Risk Estimation.

Our contribution improves on the literature on the non-asymptotic regime of benign overfitting for non-linear models by providing concentration results on generalisation instead of controlling the expected risk only. More precisely, our results quantify the proximity in $\ell_{2}$ of a certain solution $\hat{\theta}$ of (2) to $\theta^{*}$ . Our proof of this first result utilises an elegant continuous Newton method argument initially promoted by Neuberger [5], [14]. From this, we obtain a quantitative analysis of the performance in prediction of the minimum $\ell_{2}$ -norm solution based on a careful study relying on high dimensional probability.

2 Distance of the true model to the set of empirical risk minimisers

We study a non-isotropic model for the rows $X_{i}^{\top}$ of the design matrix, and how the Cholesky factorisation of the covariance matrix can be leveraged to improve the prediction results. Let us now describe our mathematical model.

2.1 Statistical model

In this section, we make general assumptions on the feature vectors.

Assumption 1.

We assume that (1) holds and that $f$ and the loss function $\ell$ : $\mathbb{R}\mapsto\mathbb{R}$ satisfy the following properties

•

$f$ is strictly monotonic,
•

$f^{\prime}(z)\leq C_{f^{\prime}}$ for some positive constant $C_{f^{\prime}}$ ,
•

$\ell^{\prime}(0)=0$ and $\ell^{\prime\prime}$ is upper bounded by a constant $C_{\ell^{\prime\prime}}>0$ .

Concerning the statistical data, we will make the following set of assumptions

Assumption 2.

We will require that

•

the random column vectors $X_{1},\ldots,X_{n}$ with values in $\mathbb{R}^{p}$ are independent random vectors, which can be written as $AZ_{i}$ for some matrix $A$ in $\mathbb{R}^{p\times p}$ , with at least $n$ non zero singular values, and some independent subGaussian vectors $Z_{1},\ldots,Z_{n}$ in $\mathbb{R}^{p}$ , with $\phi_{2}$ -norm upper bounded by $K_{Z}$ . The vectors $Z_{i}$ are such that the matrix

$\displaystyle Z^{\top}$ $\displaystyle=\begin{bmatrix}Z_{1},\ldots,Z_{n}\end{bmatrix}$

is full rank with probability one. We define $X=ZA^{\top}$ ;
•
for all $i=1,\ldots,n$ , the random vectors $Z_{i}$ are assumed
- –
  
  to have a second moment matrix equal to the identity, i.e. $\mathbb{E}[Z_{i}Z_{i}^{\top}]=I_{p}$
- –
  
  to have $\ell_{2}$ -norm equal¹¹1notice that this is different from the usual regression model, where the columns are assumed to be normalised. to $\sqrt{p}$ .

Assumption 3.

The errors $\varepsilon_{i}=Y_{i}-\mathbb{E}[Y_{i}\mid X_{i}]$ are assumed to be independent subGaussian centered random variables with $\psi_{2}$ -norm upper bounded by $K_{\varepsilon}$ .

The performance of the estimators is measured by the theoretical risk $R:\Theta\mapsto\mathbb{R}$ by

R(\theta)=\mathbb{E}\Big{[}\ell(Y-f(X^{\top}\theta))\Big{]}.

In order to estimate $\theta^{*}$ , the Empirical Risk Minimizer $\hat{\theta}$ is defined as a solution to the following optimisation problem

\displaystyle\hat{\theta}

\displaystyle\in\mathrm{argmin}_{\theta\in\Theta}\ \hat{R}_{n}(\theta),

(3)

with

\displaystyle\hat{R}_{n}(\theta)

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(X_{i}^{\top}\theta)).

(4)

Let us make an important assumption for what is to follow

Assumption 4.

Assume that the loss function $\ell$ and the function $f$ are such that

\displaystyle|\ell^{\prime\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{\top}\theta)^{2}-\ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime\prime}(X_{i}^{\top}\theta)|

\displaystyle\geq\delta>0

(5)

for all $\theta$ in $\mathcal{B}_{2}(\theta^{*},r)$ , which is the $\ell_{2}$ -ball of radius $r$ centered at $\theta^{*}$ .

Let us check Assumption 4 in various simple cases.

•

In the case of linear regression, we have $\ell(z)=\frac{1}{2}z^{2}$ , $\ell^{\prime}(z)=z$ , $\ell^{\prime\prime}(z)=1$ and $f(z)=z$ , $f^{\prime}(z)=1$ , $f^{\prime\prime}(z)=0$ . Thus, we get $\delta=1$ .
•

In the case of the Huber loss and the Softplus function, we have
$\ell(z)=\begin{cases}\frac{1}{2}z^{2}&\text{ for }|z|\leq\eta\\ \eta\cdot\left(|z|-\frac{1}{2}\eta\right),&\text{ otherwise }\end{cases}$ , $\ell^{\prime}(z)=\begin{cases}z&\text{ for }|z|\leq\eta\\ \eta&\text{ if }z>\eta\\ -\eta&\text{ if }z<-\eta,\end{cases}$ , $\ell^{\prime\prime}(z)=\begin{cases}1&\text{ for }|z|\leq\eta\\ 0&\text{ if }|z|>\eta\end{cases}$
and $f(z)=1/(1+\exp(-z))$ , $f^{\prime}(z)=\exp(-z)/(1+\exp(-z))^{2}$ and $f^{\prime\prime}(z)=\frac{2e^{-2z}}{\left(e^{-z}+1\right)^{3}}-\frac{e^{-z}}{\left(e^{-z}+1\right)^{2}}$ .
Moreover, $f^{\prime}(z)/f^{\prime\prime}(z)>1$ . Then, when $\eta<1$ , we can take $\delta=1-\eta$ as long as $|Y_{i}-f(X_{i}^{\top}\theta)|\leq\eta$ for all $i=1,\ldots,n$ and all $\theta\in\mathcal{B}_{2}(\theta^{*},r)$ .

Notation : We denote by $s_{i}(A)$ the $i$ -th singular value of the matrix $A$ , in decreasing order, i.e. $s_{1}(A)\geq s_{2}(A)\geq\ldots\geq s_{n}(A)$ . When $s_{n}(A)$ is non-zero, we denote by $\kappa_{n}(A)=\frac{s_{1}(A)}{s_{n}(A)}$ the n-th condition number of $A$ .

2.2 Distance to empirical risk minimisers via Neuberger’s theorem

Our main result in this section is the following Theorem. Their proofs are given in Section A.2 and Section A.3 in the appendix.

Theorem 2.1.

(Underparametrised setting)

Let $C_{K_{Z}}$ and $c_{K_{Z}}$ be positive constants depending only on the subGaussian norm $K_{Z}$ . Let $\alpha$ be a real in $]0,1[$ . Assume that $p$ and $n$ are such that $C_{K_{Z}}^{2}p<(1-\alpha)^{2}n$ and that Assumption 4 is verified. Let

\displaystyle r

\displaystyle=\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{p}}{\delta\ \left((1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p}\right)},

(6)

where $C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}$ , with $C$ a positive absolute constant.

Then, with probability at least $1-2\exp{(-c_{K_{Z}}\alpha^{2}n)}-\exp{(-p/2)}$ , the unique solution $\hat{\theta}$ to the optimisation problem (3) satisfies

\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}

\displaystyle\leq r,

where $\|\cdot\|_{2}$ is the usual Euclidean norm.

Theorem 2.2.

(Overparametrised setting)

Assume that $p$ and $n$ are such that $C_{K_{Z}}^{2}\ n<(1-\alpha)^{2}p$ and that Assumption 4 is verified. Let

\displaystyle r

\displaystyle=\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{n}}{\delta\ \left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}.

(7)

Then, there exists a first order stationary point $\hat{\theta}$ to the optimisation problem (3) such that, with probability larger than or equal to $1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\Big{(}-n/2\Big{)}}$ , we have

\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}

\displaystyle\leq r.

(8)

Remark 2.3.

Assume that $A=I$ (hence $X=Z$ ) for simplicity. In order to illustrate the previous results, consider the case where the noise level is of the same order as $\|X_{i}\|_{2}$ , i.e. $K_{\varepsilon}\sim\sqrt{p}$ , and $\delta$ is independent of $r$ , as in the linear case. In the underparametrised case, this implies that the error bound on $\|\hat{\theta}-\theta^{*}\|_{2}$ grows like $p/\sqrt{n}$ . In the overparametrised case, the error bound given by Theorem 2.2 is of order $\sqrt{n}$ . This is potentially much smaller than $\sqrt{p}$ which is the natural order for $\|\theta^{*}\|$ in the absence of sparsity.

Note also that in the different setup of Candès and Plan [4], the rows of $X$ would have norm of the order of $\sqrt{p/n}$ and the noise would have subgaussian constant $K_{\varepsilon}$ independent of the dimensions $n$ and $p$ . Multiplying by $\sqrt{n}$ , we get a norm of the $X_{i}$ ’s of the order $\sqrt{p}$ and a subgaussian constant $K_{\varepsilon}$ of the order $\sqrt{n}$ for the noise. With this parametrisation of $K_{\varepsilon}$ in mind, in the underparametrised case, the bound is of the order of $\sqrt{p}$ and in the overparametrised case, the error bound given by Theorem 2.2 would be of the order $n/\sqrt{p}$ .

2.3 The case of the linear models

2.3.1 The case of linear regression

Let us apply these theorems on the well-known linear regression model. In the linear case $f(X_{i}^{\top}h)=X_{i}^{\top}h$ , the loss $\ell(z)=\frac{1}{2}z^{2}$ is quadratic and the optimisation problem (3) is

\hat{\theta}=\mathrm{argmin}_{\theta\in\mathbb{R}^{p}}\ \frac{1}{2n}\sum_{i=1}^{n}(Y_{i}-X_{i}^{\top}\theta)^{2}.

We therefore have:

Corollary 2.4 (Underparametrised case: linear model).

Let

\displaystyle r

\displaystyle=\frac{C_{\varepsilon}\sqrt{p}}{\left((1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p}\right)},

where $C_{\varepsilon}=6\sqrt{C}K_{\varepsilon}$ . Under the same assumptions as Theorem 2.1, the unique solution $\hat{\theta}$ to the optimisation problem (3) satisfies

\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}

\displaystyle\leq r,

with probability larger than or equal to $1-2\exp{(-c_{K_{Z}}\alpha^{2}n)}-\exp{(-p/2)}$ .

Proof.

Apply Theorem 2.1 with $f(X_{i}^{\top}h)=X_{i}^{\top}h$ and $\ell(y)=\frac{1}{2}y^{2}$ . Then we have

$\bullet$

$\ell^{\prime}(y)=y$ and $\ell^{\prime\prime}(y)=1$ , hence $C_{\ell^{(k)}}=0$ , $k\geq 3$ .
$\bullet$

$f(X_{i}^{\top}h)=X_{i}^{\top}h$ , hence $C_{f^{\prime}}=1$ and $C_{f^{(k)}}=0$ for $k\geq 2$ .

Consequently, we have $\delta=1$ in the linear case and Assumption 4 is verified. Furthermore, the constant $C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}$ simplifies to $6\sqrt{C}K_{\varepsilon}$ , which is denoted by $C_{\varepsilon}$ .

This concludes the proof. ∎

Corollary 2.5 (Overparametrised case: linear model).

Let

\displaystyle r

\displaystyle=\frac{C_{\varepsilon}\sqrt{n}}{\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)},

where $C_{\varepsilon}=6\sqrt{C}K_{\varepsilon}$ . Under the same assumptions as Theorem 2.2, there exists a solution $\hat{\theta}$ to the optimisation problem (3) such that, with probability larger than or equal to $1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{(-n/2)}$ , we have

\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}

\displaystyle\leq r.

2.4 Discussion of the results

Our results are based on a new zero finding approach inspired from [14] and we obtain precise quantitative results in the finite sample setting for linear and non-linear models. Following the initial discovery of the ”double descent phenomenon” in [2], many authors have addressed the question of precisely characterising the error decay as a function of the number of parameters in the linear and non-linear setting (mostly based on random feature models). Some of the latest works, such as [11], address the problem in the asymptotic regime. Our results provide an explicit control of the distance between some empirical estimators and the ground truth in terms of the subGaussian norms of the error and the covariance matrix of the covariate vectors. Our analysis employs efficient techniques from [14], combined with various results from Random Matrix Theory and high dimensional probability, as summarized in [17] and extenstively discussed in [18]. Theorem 2.1 and Theorem 2.2 provide a new finite sample analysis of the problem of estimating ridge functions in both underparametrised and overparametrised regimes, i.e. where the number of parameters is smaller (resp. larger) than the sample size.

•

Our analysis of the underparametrised setting shows that we can obtain an error of order less than or equal to $\sqrt{p/n}$ for all $p$ up to an arbitrary fraction of $n$ .
•

In the overparametrised setting, we get that the error bound is approaching the noise level plus $t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}$ as $p$ grows to $+\infty$ . This term can be small if the rank of $A$ is small, and even more so if $\theta^{*}$ is concentrated in the null space of $A$ or approximatively so, which is reminiscent of the results in [16] and the recent work [9].

Similar but simpler bounds also hold in the linear model setting, as presented in Corollary 2.4 and Corollary 2.5.

The following section addresses the problem of studying the prediction error in probability of the least $\ell_{2}$ -norm empirical risk minimiser in the overparamatrised setting.

3 Generalisation of least norm empirical risk minimisers

In this section, we leverage the results of the former section in order to study the prediction error of the least $\ell_{2}$ -norm empirical risk minimiser. Our main result is Theorem 3.1 from the following section.

3.1 Main result

Using the previous results, we obtain the following theorem about benign overfitting in the overparametrised case.

Theorem 3.1.

Assume that $\hat{\theta}$ is interpolating, i.e. $\hat{R}_{n}(\hat{\theta})=0$ . Let $\hat{\theta}^{\circ}$ denote the minimum norm interpolating solution to the ERM problem, i.e.

\displaystyle\text{argmin}_{\theta}\ \|\theta\|_{2}\quad\text{ subject to }\hat{R}_{n}(\theta)=0.

(9)

Under the same assumptions as in Theorem 2.2, for any constant $\tau>0$ , and for

\displaystyle r

\displaystyle=\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{n}}{\delta\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)},

(10)

with $C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}$ , where $C$ is a positive absolute constant, we have

\displaystyle|f(X_{n+1}^{\top}\hat{\theta}^{\circ})-f(X_{n+1}^{\top}\theta^{*})|

\displaystyle\leq C_{f^{\prime}}\left(\left|\varepsilon_{n+1}\right|+t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}+\sqrt{\tau(n/p)\log(n)}+t\ \frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}\right)

(11)

with probability at least

	$\displaystyle 1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}-\exp(-c_{K_{Z}}t^{2})$
	$\displaystyle\hskip 28.45274pt-\exp\left(-\frac{t^{2}}{2}\right)-4\exp{\left(-c_{K_{Z}}\alpha p\right)}-\exp\left(-cC^{\prime 2}n\right)-2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)$

where $c$ and $C^{\prime}$ are absolute positive constants.

Proof.

See Section B. ∎

Remark 3.2.

For isotropic data, take $A=I_{p}$ . Notice that, in accordance with the concept of double descent as discussed in [2], when $n$ grows (while staying smaller than $p$ in the overparametrised regime), the ratio $n/p$ approaches $1$ from the right and the upper bound worsens.

3.2 Comparison with previous results

Recently, the finite sample analysis has been addressed in the very interesting works [1], [6] and [9] for the linear model only. These works propose very precise upper and lower bounds on the prediction risk for general covariate covariance matrices under the subGaussian assumption or the more restrictive Gaussian assumption. In [1], excess risk is considered and the norm of $\|\theta^{*}\|$ appears in the proposed upper bound instead of $\|A^{\top}\theta^{*}\|_{2}$ , which is potentially much smaller if $\theta^{*}$ has a large component in the kernel of $A$ or in spaces where the singular values of $A$ are small. The same issue arises in [6], where the noise is not assumed Gaussian nor subGaussian and can be dependent on the design matrix $X$ . In [9], the authors propose a very strong bound that incorporates information that is more general than $\|A\theta^{*}\|_{2}$ . However the setting studied in [9] is restricted to linear models and Gaussian observation noise. Studies on nonlinear models are scarce and a notable exception is [7], where the setting is very different from ours and the number of parameters is constrained in a more rigid fashion than in the present work.

In the present work, we show that similar, very precise results can be obtained for non-linear models of the ridge function class, using elementary perturbation results and some (now) standard random matrix theory. Moreover, our results concern the probability of exceeding a certain error level in predicting a new data point, whereas most results in the literature are restricted to the expected risk.

4 Conclusion and perspectives

This work presents a precise quantitative, finite sample analysis of the benign overfitting phenomenon in the estimation of linear and certain non-linear models. We make use of a zero-finding result of Neuberger [14] which can be applied to a large number of settings in machine learning.

Extending our work to the case of Deep Neural Networks is an exciting avenue for future research. Another possible direction is to include penalisation, which can be addressed using the same techniques via Karush-Kuhn-Tucker conditions. Applying this approach to Ridge Regression and $\ell_{1}$ -penalised estimation would bring new proof techniques into the field to investigate potentially deeper results. Weakening the assumptions on our data, which are here of subGaussian type, could also lead to interesting new results; this could be achieved by utilising, e.g. the work of [13].

References

[1] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
[2] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
[3] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
[4] Emmanuel J Candès, Yaniv Plan, et al. Near-ideal model selection by $\ell_{1}$ minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
[5] Alfonso Castro and JW Neuberger. An inverse function theorem via continuous newton’s method. 2001.
[6] Geoffrey Chinot and Matthieu Lerasle. Benign overfitting in the large deviation regime. arXiv preprint arXiv:2003.05838, 2020.
[7] Spencer Frei, Niladri S Chatterji, and Peter Bartlett. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022.
[8] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
[9] Guillaume Lecué and Zong Shang. A geometrical viewpoint on the benign overfitting property of the minimum $l\_2$ -norm interpolant estimator. arXiv preprint arXiv:2203.05873, 2022.
[10] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel” ridgeless” regression can generalize. arXiv preprint arXiv:1808.00387, 2018.
[11] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
[12] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
[13] Shahar Mendelson. Extending the small-ball method. arXiv preprint arXiv:1709.00843, 2017.
[14] John W Neuberger. The continuous newton’s method, inverse functions, and nash-moser. The American Mathematical Monthly, 114(5):432–437, 2007.
[15] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics.
[16] Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286, 2020.
[17] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
[18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Appendix A Proofs of Theorem 2.1 and Theorem 2.2

A.1 Neuberger’s theorem

The following theorem of Neuberger [14] will be instrumental in our study of the ERM. In our context, this theorem can be restated as follows.

Theorem A.1 (Neuberger’s theorem for ERM).

Suppose that $r>0$ , that $\theta^{\ast}\in\mathbb{R}^{p}$ and that the Jacobian $D\hat{R}_{n}(\cdot)$ is a continuous map on $B_{r}(\theta^{\ast})$ , with the property that for each $\theta$ in $b_{r}(\theta^{\ast})$ there exists a vector $d$ in $B_{r}(0)$ such that,

\displaystyle\lim_{t\downarrow 0}\ \frac{D\hat{R}_{n}(\theta+td)-D\hat{R}_{n}(\theta)}{t}=-D\hat{R}_{n}(\theta^{*}).

(12)

Then there exists $u$ in $B_{r}(\theta^{\ast})$ such that $D\hat{R}_{n}(u)=0$ .

A.1.1 Computing the derivatives

Since the loss is twice differentiable, the empirical risk $\hat{R}_{n}$ is itself twice differentiable. The Gradient of the empirical risk is given by

\displaystyle\nabla\hat{R}_{n}(\theta)

\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\ \ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{\top}\theta)X_{i}.

(13)

Hence, for $\theta=\theta^{*}$ , and recalling that $Y_{i}-f(X_{i}^{\top}\theta^{*})=\varepsilon_{i}$ , we get

\displaystyle=-\frac{1}{n}X^{\top}D_{\nu}\ \ell^{\prime}(\varepsilon),

(14)

where $\ell^{\prime}(\varepsilon)$ is to be understood componentwise, and $D_{\nu}$ is a diagonal matrix with coefficients $\nu_{i}=f^{\prime}(X_{i}^{\top}\theta^{\ast})$ for all $i$ in $1,\ldots,n$ .

The Hessian is given by

	$\displaystyle\nabla^{2}\hat{R}_{n}(\theta)$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ \Big{(}\ell^{\prime\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{t}\theta)^{2}-\ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime\prime}(X_{i}^{\top}\theta)\Big{)}X_{i}X_{i}^{t}$
		$\displaystyle=\frac{1}{n}\ X^{\top}D_{\mu}X,$		(15)

where $D_{\mu}$ is a diagonal matrix given by, for all $i$ in $1,\ldots,n$

\displaystyle\mu_{i}

\displaystyle=\ell^{\prime\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{t}\theta)^{2}-\ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime\prime}(X_{i}^{\top}\theta).

Notice that when the $\mu_{1},\ldots,\mu_{n}$ are all non-negative for all $\theta\in\mathbb{R}^{p}$ , the empirical risk is convex. The condition we have to satisfy in order to use Neuberger’s theorem, i.e. the version of (12) associated with our setting, is the following

\displaystyle\nabla^{2}\hat{R}_{n}(\theta)d

\displaystyle=-\nabla\hat{R}_{n}(\theta^{\ast}).

A.1.2 A change of variable

Since

\displaystyle\hat{R}_{n}(\theta)

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(X_{i}^{\top}\theta)).

(16)

and $X_{i}^{\top}=Z_{i}^{\top}A^{\top}$ , the risk can be written as a function of $A\theta$ as follows:

\displaystyle\hat{R}^{(A)}_{n}(\theta^{(A)})

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))

(17)

with $\theta^{(A)}=A^{\top}\theta$ . Clearly, minimizing $\hat{R}_{n}$ in $\theta$ is equivalent to minimizing $\hat{R}^{(A)}_{n}$ in $\theta^{(A)}$ up to equivalence modulo the kernel of $A^{\top}$ .

	$\displaystyle\nabla\hat{R}^{(A)}_{n}(\theta^{(A)})$	$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\ \ell^{\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime}(Z_{i}^{\top}\theta^{(A)})Z_{i}.$		(18)
		$\displaystyle=-\frac{1}{n}Z^{\top}D_{\nu}\ \ell^{\prime}(\varepsilon),$		(19)

	$\displaystyle\nabla^{2}\hat{R}^{(A)}_{n}(\theta^{(A)})$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ \Big{(}\ell^{\prime\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime}(Z_{i}^{\top}\theta^{(A)})^{2}-\ell^{\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime\prime}(Z_{i}^{\top}\theta^{(A)})\Big{)}Z_{i}Z_{i}^{\top}$
		$\displaystyle=\frac{1}{n}\ Z^{\top}D_{\mu}Z,$		(20)

where $D_{\mu}$ is a diagonal matrix given by, for all $i$ in $1,\ldots,n$

\displaystyle\mu_{i}

\displaystyle=\ell^{\prime\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime}(Z_{i}^{\top}\theta^{(A)})^{2}-\ell^{\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime\prime}(Z_{i}^{\top}\theta^{(A)}).

The Neuberger equation in $\theta^{(A)}$ is now given by

\displaystyle\lim_{t\downarrow 0}\ \frac{D\hat{R}^{(A)}_{n}(\theta^{(A)}+td)-D\hat{R}^{(A)}_{n}(\theta^{(A)})}{t}=-D\hat{R}_{n}(\theta^{(A)^{*}}),

(21)

where $\theta^{(A)^{*}}=A^{\top}\theta^{*}$ .

A.1.3 A technical lemma

Lemma A.2.

For all $i=1,\ldots,n$ , the variable $\ell^{\prime}(\varepsilon_{i})$ is subGaussian with variance proxy $\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}}=C_{\ell^{\prime\prime}}\ \|\varepsilon_{i}\|_{\psi_{2}}.$

Proof.

Let us compute

\displaystyle\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}}

\displaystyle=\sup_{\gamma\geq 1}\gamma^{-1/2}\Big{(}\mathbb{E}|\ell^{\prime}(\varepsilon_{i})|^{\gamma}\Big{)}^{1/\gamma}.

Lipschitzianity of $\ell^{\prime}$ implies that

\displaystyle|\ell^{\prime}(\varepsilon_{i})-\ell^{\prime}(0)|

\displaystyle\leq C_{\ell^{\prime\prime}}|\varepsilon_{i}-0|

and since $\ell^{\prime}(0)=0$ , we get

\displaystyle|\ell^{\prime}(\varepsilon_{i})|

\displaystyle\leq C_{\ell^{\prime\prime}}|\varepsilon_{i}|,

which implies that

\displaystyle\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}}

\displaystyle=C_{\ell^{\prime\prime}}\ \sup_{\gamma\geq 1}\gamma^{-1/2}\Big{(}\mathbb{E}|\varepsilon_{i}|^{\gamma}\Big{)}^{1/\gamma}.

Thus,

\displaystyle\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}}

\displaystyle=C_{\ell^{\prime\prime}}\ \|\varepsilon_{i}\|_{\psi_{2}}<+\infty

as announced. The proof is complete. ∎

A.2 Proof of Theorem 2.1: The underparametrised case

A.2.1 Key lemma

Using Corollary C.2, we have $s_{n}(Z)>0$ as long as

\displaystyle C_{K_{Z}}^{2}p<(1-\alpha)^{2}n

for some positive constant $C_{K_{Z}}$ depending on the subGaussian norm $K_{Z}$ of $Z_{1},\ldots,Z_{n}$ .

Lemma A.3.

With probability larger than or equal to $1-\exp{\Big{(}-\frac{p}{2}\Big{)}},$ we have

\displaystyle\|d\|_{2}

\displaystyle\leq\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{p}}{\delta s_{n}(Z)},

where $C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}$ .

Proof.

We compute the solution of Neuberger’s equation (21), using the Jacobian vector formula in (19) and the Hessian matrix formula in (20):

\displaystyle\nabla^{2}\hat{R}^{(A)}_{n}(\theta^{(A)})d

\displaystyle=-\nabla\hat{R}^{(A)}_{n}(A^{\top}\theta^{*}),

which gives

\displaystyle Z^{\top}D_{\mu}Zd

\displaystyle=Z^{\top}D_{\nu}\ell^{\prime}(\varepsilon).

The singular value decomposition of $Z$ gives $Z=U\Sigma V^{\top}$ , where $U\in\mathbb{R}^{n\times p}$ is a matrix whose columns form an orthonormal family, $V\in\mathbb{R}^{p\times p}$ is an orthogonal matrix and $\Sigma\in\mathbb{R}^{p\times p}$ is a diagonal and invertible matrix. Thus, we obtain the equivalent equation

\displaystyle V\Sigma U^{\top}D_{\mu}U\Sigma V^{\top}d

\displaystyle=V\Sigma U^{\top}D_{\nu}\ell^{\prime}(\varepsilon).

Note that any solution of the system

\displaystyle U^{\top}D_{\mu}U\Sigma V^{\top}d

\displaystyle=U^{\top}D_{\nu}\ell^{\prime}(\varepsilon),

will be admissible. Hence,

\displaystyle V^{\top}d

\displaystyle=\Sigma^{-1}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon).

Then, as $\|d\|_{2}=\|V^{\top}d\|_{2}$ ,

$\displaystyle\big{\\|}d\big{\\|}_{2}$	$\displaystyle\leq\big{\\|}\Sigma^{-1}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon)\big{\\|}_{2}$
	$\displaystyle\leq\big{\\|}\Sigma^{-1}\Big{\\|}_{2}\Big{\\|}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon)\big{\\|}_{2}$
	$\displaystyle\leq\frac{1}{s_{n}(Z)}\\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\\|_{2}.$	(22)

Using maximal inequalities, we can now bound the deviation probability of

\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}

\displaystyle=\max_{\|u\|_{2}=1}\ u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon),

with $u\in\mathbb{R}^{p}$ . For this purpose, we first prove that $U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)$ is a subGaussian vector and provide an explicit upper bound on its norm. Since $U$ is a $n\times p$ matrix whose columns form an orthonormal family, $u^{\top}U^{\top}$ is a unit-norm $\ell_{2}$ vector of size $n$ which is denoted by $w$ . Then,

\displaystyle u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)

\displaystyle=\sum_{i=1}^{n}\ w_{i}\frac{\nu_{i}}{\mu_{i}}\ \ell^{\prime}(\varepsilon_{i}),

and since $\ell^{\prime}(\varepsilon_{i})$ is centered and subGaussian for all $i=1,\ldots,n$ , Vershynin’s [18, Proposition 2.6.1] gives

\displaystyle\|u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{\psi_{2}}^{2}

\displaystyle\leq C\sum_{i=1}^{n}\ \left\|w_{i}\frac{\nu_{i}}{\mu_{i}}\ \ell^{\prime}(\varepsilon_{i})\right\|_{\psi_{2}}^{2},

for some absolute constant $C$ . Since

\displaystyle\left\|w_{i}\frac{\nu_{i}}{\mu_{i}}\ell^{\prime}(\varepsilon_{i})\right\|_{\psi_{2}}^{2}

\displaystyle\leq|w_{i}|^{2}\ \max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2},

we have

\displaystyle\|u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{\psi_{2}}^{2}

\displaystyle\leq C\|w\|_{2}^{2}\ \max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2}.

As $\|w\|_{2}=1$ , we get that for all $u\in\mathbb{R}^{p}$ with $\|u\|_{2}=1$ ,

\displaystyle\|u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{\psi_{2}}^{2}

\displaystyle\leq C\max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2}.

We deduce that $U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)$ is a subGaussian random vector with variance proxy

\displaystyle C\max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2}.

Using the maximal inequality from [15, Theorem 1.19] and the subGaussian properties described in [18, Proposition 2.5.2], for any $\eta>0$ , we get with probability $1-\exp{\Big{(}-\frac{\eta^{2}}{2}\Big{)}}$

\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}

\displaystyle\leq 2\sqrt{C}\max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}(2\sqrt{p}+\eta).

(23)

Following Lemma A.2, we deduce that $\frac{\nu_{i}}{\mu_{i}}\ell^{\prime}(\varepsilon_{i^{\prime}})$ is a subGaussian random variable with variance proxy

	$\displaystyle\left\\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ell^{\prime}(\varepsilon_{i^{\prime}})\right\\|_{\psi_{2}}$	$\displaystyle=\frac{\|\nu_{i^{\prime}}\|}{\|\mu_{i^{\prime}}\|}C_{\ell^{\prime\prime}}\\|\varepsilon_{i^{\prime}}\\|_{\psi_{2}}$
		$\displaystyle\leq C_{\ell^{\prime\prime}}\frac{\max_{i^{\prime}}^{n}\|\nu_{i^{\prime}}\|}{\min_{i^{\prime}}^{n}\|\mu_{i^{\prime}}\|}K_{\varepsilon}.$

Then, restricting a priori $\theta^{(A)}$ to lie in the ball $\mathcal{B}_{2}(\theta^{(A)^{*}},r)$ , which is coherent with the assumptions of Newberger’s Theorem A.1, and using the assumption 4, together with the boundedness of $f^{\prime}$ , we get

\displaystyle\frac{\max_{i^{\prime}=1}^{n}|\nu_{i^{\prime}}|}{\min_{i^{\prime}=1}^{n}|\mu_{i^{\prime}}|}\leq\frac{C_{f^{\prime}}}{\delta}.

Subsequently, the quantity (23) becomes

	$\displaystyle\\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\\|_{2}$	$\displaystyle\leq 2\sqrt{C}C_{\ell^{\prime\prime}}\frac{\max_{i^{\prime}=1}^{n}\|\nu_{i^{\prime}}\|}{\min_{i^{\prime}=1}^{n}\|\mu_{i^{\prime}}\|}K_{\varepsilon}(2\sqrt{p}+\eta)$
		$\displaystyle\leq\frac{2\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}(2\sqrt{p}+\eta)}{\delta},$

with probability $1-\exp{\left(-\frac{\eta^{2}}{2}\right)}$ . Taking $\eta=\sqrt{p}$ , equation (22) yields

\displaystyle\|d\|_{2}=\Big{\|}\nabla^{2}\hat{R}_{n}(\theta)^{-1}\ \nabla\hat{R}_{n}(\theta^{*})\Big{\|}_{2}\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{p}}{\delta\ s_{n}(Z)},

with probability $1-\exp{\left(-\frac{p}{2}\right)}$ . ∎

A.2.2 End of the proof of Theorem 2.1

In order to complete the proof of Theorem 2.1, note that, using Corollary C.2, we have with probabilty $1-2\exp{(-c_{K_{Z}}\alpha^{2}n)}$

\displaystyle s_{n}(Z)\geq(1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p},

where $C_{K_{Z}},c_{K_{Z}}>0$ depend only on the subGaussian norm $K_{Z}$ . Therefore, with probability larger than or equal to

\displaystyle 1

\displaystyle-2\exp{(-c_{K_{Z}}\alpha^{2}n)}-\exp{\left(-\frac{p}{2}\right)},

we have

\displaystyle\|d\|_{2}

\displaystyle\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{p}}{\delta\left((1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p}\right)}.

Hence, the proof of Theorem 2.1 is completed.

A.3 Proof of Theorem 2.2: The overparametrised case

A.3.1 Key lemma

Using Theorem C.3, we have $s_{n}(Z^{\top})>0$ as long as

\displaystyle C_{K_{Z}}^{2}\ n<(1-\alpha)^{2}p.

for some positive constant $C_{K_{Z}}$ depending on the subGaussian norm $K_{Z}$ of $Z_{1},\ldots,Z_{n}$ .

Lemma A.4.

With probability larger than or equal to $1-\exp{\left(-\frac{n}{2}\right)}$ , we have

\displaystyle\|d\|_{2}

\displaystyle\leq\frac{C_{(f^{\prime\prime},\ell^{\prime},\varepsilon)}\sqrt{n}}{\delta\ s_{n}(Z^{\top})}.

where $C_{(f^{\prime\prime},\ell^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}$ .

Proof.

As in the underparametrised case, we have to solve (12) i.e

\displaystyle Z^{\top}D_{\mu}Zd=Z^{\top}D_{\nu}\ell^{\prime}(\varepsilon),

which can be solved by finding the least norm solution of the interpolation problem

\displaystyle D_{\mu}Zd

\displaystyle=D_{\nu}\ell^{\prime}(\varepsilon),

(24)

i.e.

\displaystyle d

\displaystyle=Z^{\dagger}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)

where $Z^{\dagger}$ is the Moore-Penrose pseudo-inverse of $Z$ .

Given the compact SVD of $Z=U\Sigma V^{\top}$ , where $U\in O(n)$ and $V\in\mathbb{R}^{p\times n}$ with orthonormal columns, we get

\displaystyle D_{\mu}U\Sigma V^{\top}d

\displaystyle=D_{\nu}\ell^{\prime}(\varepsilon),

We then have

\displaystyle\|V^{\top}d\|_{2}

\displaystyle=\|\Sigma^{-1}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}.

As $\|x\|_{2}=\|Vx\|_{2}$ for all $x\in\mathbb{R}^{n}$ ,

\displaystyle\|d\|_{2}

\displaystyle\leq\frac{1}{s_{n}(Z^{\top})}\|U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}.

Then, using the bound

\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}\leq\frac{2\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}(2\sqrt{n}+\eta)}{\delta}

with probability $1-\exp{\left(-\frac{\eta^{2}}{2}\right)}$ , which can be obtained in the same way as (23) for the underparametrised case, and taking $\eta=\sqrt{n}$ , we get

\displaystyle\|d\|_{2}

\displaystyle\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta\ s_{n}(Z^{\top})},

(25)

with probability $1-\exp{\left(-\frac{n}{2}\right)}$ . ∎

A.3.2 End of the proof of Theorem 2.2

In order to complete the proof of Theorem 2.2, note that, using Theorem C.3, we have with probability $1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}$

\displaystyle s_{n}(Z^{\top})\geq(1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}.

(26)

Therefore, with probability larger than or equal to

\displaystyle 1

\displaystyle-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)},

we have

\displaystyle\|d\|_{2}

\displaystyle\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}.

Hence the proof of Theorem 2.2 is completed.

Appendix B Proof of Theorem 3.1

B.1 First part of the proof

Since $\hat{\theta}$ and $\hat{\theta}^{\circ}$ are two interpolating minimisers of the empirical risk, i.e. satisfy $Y_{i}-f(X_{i}^{\top}\hat{\theta})=0$ and $Y_{i}-f(X_{i}^{\top}\hat{\theta}^{\circ})=0$ for all $i$ in $1,\ldots,n$ , then using that $f$ is strictly monotonic, we get

\displaystyle X\hat{\theta}^{\circ}

\displaystyle=X\hat{\theta}.

(27)

B.2 Second part of the proof

Recall that $\hat{\theta}^{\circ}$ denote the minimum norm solution to the ERM, i.e.

\displaystyle\text{argmin}_{\theta}\ \|\theta\|_{2}\quad\text{ subject to }X\theta=X\hat{\theta}.

Let $\hat{\theta}^{\sharp}$ solves

\displaystyle\text{argmin}_{\theta}\ \|\theta\|_{2}\quad\text{ subject to }\begin{bmatrix}X\\ X_{n+1}^{\top}\end{bmatrix}\theta=\begin{bmatrix}X\\ X_{n+1}^{\top}\end{bmatrix}\hat{\theta},

where $\hat{\theta}$ is the solution exhibited in Theorem 2.2. Then,

	$\displaystyle\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\theta^{*})\|$	$\displaystyle\leq\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})\|+\|\underbrace{X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta})}_{=0\text{ by definition}}\|+\|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})\|,$
		$\displaystyle\leq\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})\|+\|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})\|$		(28)

and since $X_{n+1}$ and $\hat{\theta}-\theta^{*}$ are independent, we have

	$\displaystyle\mathbb{P}(\|X_{n+1}^{\top}(\hat{\theta}-\theta^{})\|\geq t\\|A^{\top}(\hat{\theta}-\theta^{})\\|_{2})$	$\displaystyle=\mathbb{P}(\|Z_{n+1}^{\top}A^{\top}(\hat{\theta}-\theta^{})\|\geq t\\|A^{\top}(\hat{\theta}-\theta^{})\\|_{2})$
		$\displaystyle\leq\exp(-c_{K_{Z}}t^{2}).$

Now, by Theorem 2.2, about the bound on $\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}$ , we get

\displaystyle|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})|

\displaystyle\leq t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})},

(29)

with probability at least

\displaystyle 1

\displaystyle-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}{\color[rgb]{0,0,0}-\exp(-c_{K_{Z}}t^{2})}.

B.3 Third part of the proof: Block pseudo-inverse computation

Since $\hat{\theta}^{\sharp}$ is the minimum $\ell_{2}$ -norm solution to the under-determined system

\displaystyle\left[\begin{array}[]{l}X\\ X_{n+1}^{\top}\end{array}\right]\theta=\left[\begin{array}[]{l}y_{1:n}\\ y_{n+1}\end{array}\right],

then we have

\displaystyle\hat{\theta}^{\sharp}

\displaystyle=\left(X\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)\right)^{\dagger}y_{1:n}+\left(X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\right)^{\dagger}y_{n+1}.

This gives

\displaystyle\hat{\theta}^{\sharp}

\displaystyle=\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)X^{\dagger}y_{1:n}+\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\top^{\dagger}}y_{n+1},

(30)

since

\displaystyle\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)^{\dagger}=\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)

and

\displaystyle\left(X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{-1}X\right)\right)^{\dagger}=\left(X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\right).

B.4 Fourth part of the proof

Since $\hat{\theta}^{\circ}=X^{\dagger}y_{1:n}$ , we get from the expression of $\hat{\theta}^{\sharp}$ in (30) that

	$\displaystyle\|X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta}^{\circ})\|$	$\displaystyle=\left\|-\frac{1}{p}X_{n+1}^{\top}X_{n+1}X_{n+1}^{\top}X^{\dagger}y_{1:n}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\dagger}y_{n+1}\right\|$
		$\displaystyle\leq\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}X\theta^{}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\dagger}X_{n+1}^{\top}\theta^{}\right\|$
		$\displaystyle\hskip 56.9055pt+\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right\|+\left\|\varepsilon_{n+1}\right\|$
		$\displaystyle\leq\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}X\theta^{}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\theta^{}\right\|$
		$\displaystyle\hskip 56.9055pt+\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right\|+\left\|\varepsilon_{n+1}\right\|,$

which gives

\displaystyle|X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta}^{\circ})|

\displaystyle\leq\left|X_{n+1}^{\top}\theta^{*}\right|+\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|+\left|\varepsilon_{n+1}\right|.

(31)

B.5 Fifth part of the proof

In this section, we control of each term in the RHS of (31).

Control of $\left|X_{n+1}^{\top}\theta^{*}\right|$

We have

\displaystyle\mathbb{P}(|\langle\theta^{*},X_{n+1}\rangle|\geq t\ K_{X}\|\theta^{*}\|_{2})=\mathbb{P}(|\langle A^{\top}\theta^{*},Z_{n+1}\rangle|\geq t\ K_{Z}\|A^{\top}\theta^{*}\|_{2})

\displaystyle\leq\exp\left(-\frac{t^{2}}{2}\right).

(32)

Control of $\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|$

Let us compute an upper bound on the norm of $X^{\top}(XX^{\top})^{\dagger}\varepsilon$ which holds with large probability. We have

\displaystyle\|X^{\top}(XX^{\top})^{\dagger}\varepsilon\|_{2}^{2}

\displaystyle=\varepsilon^{\top}(XX^{\top})^{\dagger}\varepsilon=\varepsilon^{\top}(ZA^{\top}AZ^{\top})^{\dagger}\varepsilon

Our main tool will be the following result from Vershynin’s book [18, Exercise 6.3.5]

\displaystyle\mathbb{P}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq CK_{\varepsilon}\|(XX^{\top})^{\dagger^{1/2}}\|_{F}+t\mid X\right)\leq\exp\left(-\frac{ct^{2}}{K_{\varepsilon}^{2}\|(XX^{\top})^{\dagger^{1/2}}\|^{2}}\right),

(33)

where $C^{\prime}$ and $c$ are absolute positive constants.

Let us introduce the Singular Value Decomposition of

\displaystyle A=U_{(A)}\Sigma_{(A)}V_{(A)}^{\top}

with $\Sigma_{(A})$ is a $r_{(A)}\times r_{(A)}$ matrix, where $r_{(A)}$ is the rank of $A$ . We then have

	$\displaystyle XX^{\top}$	$\displaystyle=ZA^{\top}AZ^{\top}=Z(U_{(A)}\Sigma_{(A)}V_{(A)}^{\top})^{\top}(U_{(A)}\Sigma_{(A)}V_{(A)}^{\top})Z^{\top}=ZV_{(A)}\Sigma_{(A)}U_{(A)}^{\top}U_{(A)}\Sigma_{(A)}V_{(A)}^{\top}Z^{\top}$
		$\displaystyle=ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top}\quad\text{because $U$ has orthogonal columns}$

Then we have

\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\|_{F}^{2}

\displaystyle=\left\|\left(ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top}\right)^{\dagger^{1/2}}\right\|_{F}^{2}=\sum_{i=1}^{n}\lambda_{i}\left(\left(ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top}\right)^{\dagger}\right),

and since $\lambda_{i}((BB^{\top}))=s_{i}(B)^{2}$ for some matrix $B$ :

	$\displaystyle\\|(XX^{\top})^{\dagger^{\frac{1}{2}}}\\|_{F}^{2}$	$\displaystyle=\sum_{i=1}^{n}s_{i}\left(ZV_{(A)}\Sigma_{(A)}\right)^{-2}$
		$\displaystyle\leq n\ s_{n}(\Sigma_{(A)}V_{(A)}^{T}Z^{T})^{-2}$
		$\displaystyle\leq n\ s_{n}(\Sigma_{(A)})^{-2}\ s_{n}(V_{(A)}^{\top}Z^{\top})^{-2}$

where $s_{n}$ denotes the $n^{th}$ largest singular value. Since the columns of $V_{(A)}$ form an othornomal family, we get

\displaystyle\|(XX^{\top})^{\dagger^{\frac{1}{2}}}\|_{F}^{2}

\displaystyle\leq n\ s_{n}(Z^{\top})^{-2}s_{n}(\Sigma_{(A)})^{-2}\leq n\ s_{n}(Z^{\top})^{-2}s_{n}(A)^{-2}.

Hence

\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\|_{F}

\displaystyle\leq\sqrt{n}\ s_{n}(Z^{\top})^{-1}s_{n}(A)^{-1}.

Since $(XX^{\top})^{\dagger^{1/2}}$ is symmetric, we have

	$\displaystyle\\|(XX^{\top})^{\dagger^{1/2}}\\|$	$\displaystyle=\lambda_{1}((XX^{\top})^{\dagger^{1/2}})=\lambda_{1}(((ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top})^{\dagger})^{1/2})$
		$\displaystyle=s_{1}((ZV_{(A)}\Sigma_{(A)})^{\dagger})$
		$\displaystyle=\frac{1}{s_{n}(\Sigma_{(A)}V_{(A)}Z^{T})}$
		$\displaystyle\leq\frac{1}{s_{n}(\Sigma_{(A)})s_{n}(Z^{T})}$
		$\displaystyle=\frac{1}{s_{n}(Z^{T})s_{n}(A)}.$

Define the event

\displaystyle\mathcal{E}_{Z}

\displaystyle=\left\{s_{n}\left(Z^{\top}\right)\geq\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)\right\}.

Then, following Theorem C.3

\displaystyle\mathbb{P}(\mathcal{E}_{Z}^{c})

\displaystyle\leq 2\exp\left(-c_{K_{Z}}\alpha p\right).

Let us notice that, on this event $\mathcal{E}_{Z}$ , we have

	$\displaystyle\\|(XX^{\top})^{\dagger^{1/2}}\\|_{F}$	$\displaystyle\leq\frac{\sqrt{n}}{s_{n}(A)\ \left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}$
	$\displaystyle\\|(XX^{\top})^{\dagger^{1/2}}\\|$	$\displaystyle\leq\frac{1}{s_{n}(A)\ \left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}.$

Using (33) after setting

\displaystyle t

\displaystyle=\sqrt{n}\ \frac{C^{\prime}K_{\varepsilon}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)},

and conditioning on $Z$ , we get

\displaystyle\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)\leq\exp{\left(-cC^{\prime 2}n\right)},

(34)

so long as $Z\in\mathcal{E}_{Z}$ .

Let us define

\displaystyle\mathcal{E}_{\varepsilon,Z}

\displaystyle=\left\{\varepsilon,Z\mid\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\text{ and }Z\in\mathcal{E}_{Z}\right\},

then

\displaystyle\mathcal{E}_{\varepsilon,Z}^{c}

\displaystyle=\left\{\varepsilon,Z\mid\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\cup Z\notin\mathcal{E}_{Z}\right\}.

Notice that

	$\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}\right)$	$\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[1_{\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2^{\prime}CK_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{E}_{\varepsilon}\left[1_{\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}1_{Z\in\mathcal{E}_{Z}}\Big{\|}Z\right]\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{E}_{\varepsilon}\left[1_{\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}\Big{\|}Z\right]1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]$

which gives

	$\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}^{c}\right)$	$\displaystyle=1-\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[1-\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[1-\left(1-\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)\right)1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]+1-\mathbb{E}_{Z}\left[1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle\leq\mathbb{E}_{Z}\left[\exp\left(-cC^{\prime 2}n\right)1_{Z\in\mathcal{E}_{Z}}\right]+\mathbb{P}\left(\mathcal{E}_{Z}^{c}\right)$
		$\displaystyle\leq\exp\left(-cC^{\prime 2}n\right)\underbrace{\mathbb{P}_{Z}\left(Z\in\mathcal{E}_{Z}\right)}_{\leq 1}+\mathbb{P}\left(\mathcal{E}_{Z}^{c}\right),$

which finally yields

\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}^{c}\right)

\displaystyle\leq\exp\left(-cC^{\prime 2}n\right)+2\exp\left(-c_{K_{Z}}\alpha p\right).

Let us turn to the study of $|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|$ . Conditioning on $\varepsilon$ , we have

	$\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\right)$	$\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[\mathbb{P}_{X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)\right]$
		$\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[\mathbb{P}_{X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)I_{\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}}\right]$
		$\displaystyle\hskip 28.45274pt+\mathbb{E}_{\varepsilon,Z}\left[\underbrace{\mathbb{P}_{X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)}_{\leq 1}I_{\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}}\right]$
		$\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[\mathbb{P}_{X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)I_{\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}}\right]$
		$\displaystyle\hskip 28.45274pt+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right).$

On the other hand, the subGaussianity of $Z_{n+1}$ ,

	$\displaystyle\mathbb{P}_{X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)$	$\displaystyle=\mathbb{P}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,AZ_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)$
		$\displaystyle=\mathbb{P}\left(\|\langle A^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon,Z_{n+1}\rangle\|\geq u\mid\varepsilon,Z\right)$
		$\displaystyle\leq 2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}\ \\|A^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\\|_{2}^{2}}\right)$

which gives

\displaystyle\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)

\displaystyle\leq 2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}s_{1}(A)^{2}\ \|X^{\top}(XX^{\top})^{\dagger}\varepsilon\|_{2}^{2}}\right).

Then

	$\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq u\right)$	$\displaystyle\leq\mathbb{E}_{\varepsilon,Z}\left[2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}s_{1}(A)^{2}\ \\|X^{\top}(XX^{\top})^{\dagger}\varepsilon\\|_{2}^{2}}\right)I_{\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}}\right]$
		$\displaystyle\hskip 28.45274pt+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right)$
		$\displaystyle\leq 2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}s_{1}(A)^{2}}\frac{s_{n}(A)^{2}\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)^{2}}{4C^{\prime 2}K_{\varepsilon}^{2}n}\right)\mathbb{P}_{\varepsilon,Z}\left[\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}\right]$
		$\displaystyle+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right)$
		$\displaystyle\leq 2\exp\left(-\frac{u^{2}\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)^{2}}{8\ K_{Z}^{2}\kappa_{n}(A)^{2}C^{\prime 2}K_{\varepsilon}^{2}n}\right)+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right),$

with $\kappa_{n}(A)=\frac{s_{1}(A)}{s_{n}(A)}$ the $n$ -th condition number.

And since

	$\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right)$	$\displaystyle\leq\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c})+\mathbb{P}(\mathcal{E}_{\varepsilon,Z}^{c}\right)\leq 2\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp{(-cC^{\prime 2}n)}+\mathbb{P}(\mathcal{E}_{Z}^{c})$
		$\displaystyle\leq 4\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp\left(-cC^{\prime 2}n\right),$

we get

\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\right)\leq 2\exp\left(-\frac{u^{2}\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}n}\right)+4\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp\left(-cC^{\prime 2}n\right).

Choosing $u=\sqrt{(n/p)\log(n^{\tau})}$ for some $\tau>0$ , we get

	$\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(\|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle\|\geq\sqrt{(n/p)\log(n^{\tau})}\right)$	$\displaystyle\leq 2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)$		(35)
		$\displaystyle\hskip 56.9055pt+4\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp\left(-cC^{\prime 2}n\right).$		(36)

B.6 Final step of the proof

Combining (29), (31), (32), (36) and with (28), we get

	$\displaystyle\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\theta^{*})\|$	$\displaystyle\leq\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})\|+\|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})\|$
		$\displaystyle\leq\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})\|+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}$
		$\displaystyle\leq\left\|X_{n+1}^{\top}\theta^{*}\right\|+\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right\|+\left\|\varepsilon_{n+1}\right\|+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}$
		$\displaystyle\leq t\ K_{Z}\\|A^{\top}\theta^{*}\\|_{2}+\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right\|+\left\|\varepsilon_{n+1}\right\|+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}$
		$\displaystyle\leq\left\|\varepsilon_{n+1}\right\|+t\ K_{Z}\\|A^{\top}\theta^{*}\\|_{2}+\sqrt{(n/p)\log(n^{\tau})}+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}$

with probability at least

1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}{\color[rgb]{0,0,0}-\exp(-c_{K_{Z}}t^{2})}-\exp\left(-\frac{t^{2}}{2}\right)\\ -2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)-4\exp{\left(-c_{K_{Z}}\alpha p\right)}-\exp\left(-cC^{\prime 2}n\right).

(37)

Using $C_{f^{\prime}}$ -Lipschitzianity of $f$ , we obtain

\displaystyle|f(X_{n+1}^{\top}\hat{\theta}^{\circ})-f(X_{n+1}^{\top}\theta^{*})|

\displaystyle\leq C_{f^{\prime}}\left(\left|\varepsilon_{n+1}\right|+t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}+\sqrt{(n/p)\log(n^{\tau})}+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}\right)

with probability at least

1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}{\color[rgb]{0,0,0}-\exp(-c_{K_{Z}}t^{2})}-\exp\left(-\frac{t^{2}}{2}\right)\\ -2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)-4\exp{\left(-c_{K_{Z}}\alpha p\right)}-\exp\left(-cC^{\prime 2}n\right).

(38)

which is the desired result.

Appendix C Classical bounds on the extreme singular values of finite dimensional random matrices

C.1 Random matrices with independent rows

Recall that the matrix $X$ is composed by $n$ i.i.d. subGaussian random vectors in $\mathbb{R}^{p}$ , with $K_{X}=\max_{i}\|X_{i}\|_{\psi_{2}}$ . In the underparametrised case, we have $n>p$ . Let us recall the following bound on the singular values of a matrix with independent subGaussian rows.

Theorem C.1.

[17, Theorem 5.39] Let $X$ be an $n\times p$ matrix whose rows $X_{i}$ , $i=1,\ldots,n$ are independent subGaussian isotropic random vectors in $\mathbb{R}^{p}$ . Then for every $t\geq 0$ , with probability at least $1-2\exp(-c_{K_{X}}t^{2})$ , one has

\sqrt{n}-C_{K_{X}}\sqrt{p}-t\leq s_{n}(X)\leq s_{1}(X)\leq\sqrt{n}+C_{K_{X}}\sqrt{p}+t

where $C_{K_{X}},c_{K_{X}}>0$ depend only on the subGaussian norm $K_{X}=\max_{i}\|X_{i}\|_{\psi_{2}}$ of the rows.

In the our main text, we use the corollary

Corollary C.2.

Let us suppose that $t\geq\alpha\sqrt{n}$ with $\alpha>0$ . Using the same assumptions as in Theorem C.1, then with probability equal or larger than $1-2\exp(-c_{K_{X}}\alpha^{2}n)$

	$\displaystyle s_{n}(X)\geq(1-\alpha)\sqrt{n}-C_{K_{X}}\sqrt{p},$
	$\displaystyle s_{1}(X)\leq(1+\alpha)\sqrt{n}+C_{K_{X}}\sqrt{p}.$

C.2 Random matrices with independent columns

In the overparametrised case, the following theorem of Vershynin will be instrumental.

Theorem C.3.

[17, Theorem 5.58] Let $X$ be an $n\times p$ matrix with $n\leq p$ whose rows $X_{i}$ are independent subGaussian isotropic random vectors in $\mathbb{R}^{p}$ with $\|X_{i}\|_{2}=\sqrt{p}$ a.s. Then for every $t\geq 0$ , with probability at least $1-2\exp(-c_{K_{X}}\alpha\ p)$ , one has

	$\displaystyle s_{n}(X^{\top})\geq(1-\alpha)\sqrt{p}-C_{K_{X}}\sqrt{n},$
	$\displaystyle s_{1}(X^{\top})\leq(1+\alpha)\sqrt{p}+C_{K_{X}}\sqrt{n},$

where $C_{K_{X}},c_{K_{X}}>0$ depend only on the subGaussian norm $K_{X}=\max_{i}\|X_{i}\|_{\psi_{2}}$ of the columns.

$\displaystyle\big{\\|}d\big{\\|}_{2}$	$\displaystyle\leq\big{\\|}\Sigma^{-1}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon)\big{\\|}_{2}$
	$\displaystyle\leq\big{\\|}\Sigma^{-1}\Big{\\|}_{2}\Big{\\|}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon)\big{\\|}_{2}$
	$\displaystyle\leq\frac{1}{s_{n}(Z)}\\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\\|_{2}.$	(22)

	$\displaystyle\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\theta^{*})\|$	$\displaystyle\leq\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})\|+\|\underbrace{X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta})}_{=0\text{ by definition}}\|+\|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})\|,$
		$\displaystyle\leq\|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})\|+\|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})\|$		(28)

	$\displaystyle\|X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta}^{\circ})\|$	$\displaystyle=\left\|-\frac{1}{p}X_{n+1}^{\top}X_{n+1}X_{n+1}^{\top}X^{\dagger}y_{1:n}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\dagger}y_{n+1}\right\|$
		$\displaystyle\leq\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}X\theta^{}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\dagger}X_{n+1}^{\top}\theta^{}\right\|$
		$\displaystyle\hskip 56.9055pt+\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right\|+\left\|\varepsilon_{n+1}\right\|$
		$\displaystyle\leq\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}X\theta^{}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\theta^{}\right\|$
		$\displaystyle\hskip 56.9055pt+\left\|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right\|+\left\|\varepsilon_{n+1}\right\|,$

	$\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}\right)$	$\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[1_{\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2^{\prime}CK_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{E}_{\varepsilon}\left[1_{\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}1_{Z\in\mathcal{E}_{Z}}\Big{\|}Z\right]\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{E}_{\varepsilon}\left[1_{\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}\Big{\|}Z\right]1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]$

	$\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}^{c}\right)$	$\displaystyle=1-\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[1-\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[1-\left(1-\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)\right)1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle=\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{\|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]+1-\mathbb{E}_{Z}\left[1_{Z\in\mathcal{E}_{Z}}\right]$
		$\displaystyle\leq\mathbb{E}_{Z}\left[\exp\left(-cC^{\prime 2}n\right)1_{Z\in\mathcal{E}_{Z}}\right]+\mathbb{P}\left(\mathcal{E}_{Z}^{c}\right)$
		$\displaystyle\leq\exp\left(-cC^{\prime 2}n\right)\underbrace{\mathbb{P}_{Z}\left(Z\in\mathcal{E}_{Z}\right)}_{\leq 1}+\mathbb{P}\left(\mathcal{E}_{Z}^{c}\right),$