This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A finite sample analysis of generalisation for minimum norm interpolators in the ridge function model with random design

Emmanuel Caron Laboratoire de Mathématiques d’Avignon (LMA), Université d’Avignon Campus Jean-Henri Fabre 301, rue Baruch de Spinoza, BP 21239 F-84 916 AVIGNON Cedex 9. Most of this work was done while the author was with the ERIC Laboratory, University of Lyon 2, Lyon, France    Stéphane Chrétien ERIC Laboratory and UFR ASSP, University of Lyon 2, 5 avenue Mendes France, 69676 Bron Cedex. He is invited researcher at the Mathematical Institute, University of Oxford, Oxford OX2 6GG, UK, the National Physical Laboratory, 1 Hampton Road, Teddington, TW11 0LW, UK, The Alan Turing Institute, British Library, 96 Euston Road, London NW1 2DB, UK, and FEMTO-ST Institute/Optics Department, 15B Avenue des Montboucons, F-25030 Besançon, France.
Abstract

Recent extensive numerical experiments in high scale machine learning have allowed to uncover a quite counterintuitive phase transition, as a function of the ratio between the sample size and the number of parameters in the model. As the number of parameters pp approaches the sample size nn, the generalisation error increases, but surprisingly, it starts decreasing again past the threshold p=np=n. This phenomenon, brought to the theoretical community attention in [2], has been thoroughly investigated lately, more specifically for simpler models than deep neural networks, such as the linear model when the parameter is taken to be the minimum norm solution to the least-squares problem, firstly in the asymptotic regime when pp and nn tend to infinity, see e.g. [8], and recently in the finite dimensional regime and more specifically for linear models [1], [16], [9]. In the present paper, we propose a finite sample analysis of non-linear models of ridge type, where we investigate the overparametrised regime of the double descent phenomenon for both the estimation problem and the prediction problem. Our results provide a precise analysis of the distance of the best estimator from the true parameter as well as a generalisation bound which complements recent works of [1] and [6]. Our analysis is based on tools closely related to the continuous Newton method [14] and a refined quantitative analysis of the performance in prediction of the minimum 2\ell_{2}-norm solution.

1 Introduction

The tremendous achievements in deep learning theory and practice have received great attention in the applied Computer Science, Artificial Intelligence and Statistics communities in the recent years. Many success stories related to the use of Deep Neural Networks have even been reported in the media and most data scientists are proficient with the Deep Learning tools available via opensource machine learning libraries such as Tensorflow, Keras, Pytorch and many others.

One of the key ingredient in their success is the huge number of parameters involved in all current architectures. While enabling unprecedented expressivity capabilities, such regimes of overparametrisations appear very counterintuitive through the lens of traditional statistical wisdom. Indeed, as intuition suggests, overparametrisation often results in interpolation, i.e. zero training error. The expected outcome of such endeavors should result in very poor generalisation performance. Nevertheless interpolating deep neural networks still generalise well despite achieving zero or very small training error.

Belkin et al. [2] recently addressed the problem of resolving this paradox, and brought some new light on the complex and perplexing relationships between interpolation and generalization. In the linear model setting, in the regime where the weights are the minimum norm solution to the least-squares problem, overfitting was proved to be benign under strong design assumptions (i.i.d.) random matrices, when pp and nn tend to infinity, see e.g. [8]. More precise results were recently obtained in the finite dimensional regime and more specifically for linear models [1], [16], [9]. Non-linear settings have also been considered such as in the particular instance of kernel ridge regression as studied in [3], [12], [10] proved that interpolation can coexist with good generalization. More recent results in the nonlinear direction are, e.g. [7] for shallow neural networks using a very general framework that combines the properties of gradient descent for a binary classification problem, but somehow restrictive assumptions on the dimension of the input of the form nμ4pCmax{nμ2,n2log(n/δ)}n\|\mu\|^{4}\gg p\geq C\max\left\{n\|\mu\|^{2},n^{2}\log(n/\delta)\right\}. In this paper, we consider a statistical model where ϕ(X;θ)=f(Xiθ)\phi(X;\theta)=f(X_{i}^{\top}\theta), i.e. a model of the form

𝔼[YiXi]\displaystyle\mathbb{E}[Y_{i}\mid X_{i}] =f(Xiθ),i=1,,n,\displaystyle=f(X_{i}^{\top}\theta^{\ast}),\qquad i=1,\ldots,n, (1)

where θp\theta^{*}\in\mathbb{R}^{p} and the function ff is assumed increasing. This setting is also known as the single index model and rigde function model. When the estimation of θ\theta^{*} is performed using Empirical Risk Estimation, i.e. by solving

θ^\displaystyle\hat{\theta} =argminθΘ1ni=1n(Yif(Xiθ))\displaystyle=\mathrm{argmin}_{\theta\in\Theta}\ \frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(X_{i}^{\top}\theta)) (2)

for a given smooth loss function \ell, we derive an upper bound on the prediction error that gives precise order of dependencies with respect to all the intrinsic parameters of the model, such as the dimensions nn and pp, as well as various bounds on the derivatives of ff and of the loss function used in the Empirical Risk Estimation.

Our contribution improves on the literature on the non-asymptotic regime of benign overfitting for non-linear models by providing concentration results on generalisation instead of controlling the expected risk only. More precisely, our results quantify the proximity in 2\ell_{2} of a certain solution θ^\hat{\theta} of (2) to θ\theta^{*}. Our proof of this first result utilises an elegant continuous Newton method argument initially promoted by Neuberger [5], [14]. From this, we obtain a quantitative analysis of the performance in prediction of the minimum 2\ell_{2}-norm solution based on a careful study relying on high dimensional probability.

2 Distance of the true model to the set of empirical risk minimisers

We study a non-isotropic model for the rows XiX_{i}^{\top} of the design matrix, and how the Cholesky factorisation of the covariance matrix can be leveraged to improve the prediction results. Let us now describe our mathematical model.

2.1 Statistical model

In this section, we make general assumptions on the feature vectors.

Assumption 1.

We assume that (1) holds and that ff and the loss function \ell : \mathbb{R}\mapsto\mathbb{R} satisfy the following properties

  • ff is strictly monotonic,

  • f(z)Cff^{\prime}(z)\leq C_{f^{\prime}} for some positive constant CfC_{f^{\prime}},

  • (0)=0\ell^{\prime}(0)=0 and ′′\ell^{\prime\prime} is upper bounded by a constant C′′>0C_{\ell^{\prime\prime}}>0.

Concerning the statistical data, we will make the following set of assumptions

Assumption 2.

We will require that

  • the random column vectors X1,,XnX_{1},\ldots,X_{n} with values in p\mathbb{R}^{p} are independent random vectors, which can be written as AZiAZ_{i} for some matrix AA in p×p\mathbb{R}^{p\times p}, with at least nn non zero singular values, and some independent subGaussian vectors Z1,,ZnZ_{1},\ldots,Z_{n} in p\mathbb{R}^{p}, with ϕ2\phi_{2}-norm upper bounded by KZK_{Z}. The vectors ZiZ_{i} are such that the matrix

    Z\displaystyle Z^{\top} =[Z1,,Zn]\displaystyle=\begin{bmatrix}Z_{1},\ldots,Z_{n}\end{bmatrix}

    is full rank with probability one. We define X=ZAX=ZA^{\top};

  • for all i=1,,ni=1,\ldots,n, the random vectors ZiZ_{i} are assumed

    • to have a second moment matrix equal to the identity, i.e. 𝔼[ZiZi]=Ip\mathbb{E}[Z_{i}Z_{i}^{\top}]=I_{p}

    • to have 2\ell_{2}-norm equal111notice that this is different from the usual regression model, where the columns are assumed to be normalised. to p\sqrt{p}.

Assumption 3.

The errors εi=Yi𝔼[YiXi]\varepsilon_{i}=Y_{i}-\mathbb{E}[Y_{i}\mid X_{i}] are assumed to be independent subGaussian centered random variables with ψ2\psi_{2}-norm upper bounded by KεK_{\varepsilon}.

The performance of the estimators is measured by the theoretical risk R:ΘR:\Theta\mapsto\mathbb{R} by

R(θ)=𝔼[(Yf(Xθ))].R(\theta)=\mathbb{E}\Big{[}\ell(Y-f(X^{\top}\theta))\Big{]}.

In order to estimate θ\theta^{*}, the Empirical Risk Minimizer θ^\hat{\theta} is defined as a solution to the following optimisation problem

θ^\displaystyle\hat{\theta} argminθΘR^n(θ),\displaystyle\in\mathrm{argmin}_{\theta\in\Theta}\ \hat{R}_{n}(\theta), (3)

with

R^n(θ)\displaystyle\hat{R}_{n}(\theta) =1ni=1n(Yif(Xiθ)).\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(X_{i}^{\top}\theta)). (4)

Let us make an important assumption for what is to follow

Assumption 4.

Assume that the loss function \ell and the function ff are such that

|′′(Yif(Xiθ))f(Xiθ)2(Yif(Xiθ))f′′(Xiθ)|\displaystyle|\ell^{\prime\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{\top}\theta)^{2}-\ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime\prime}(X_{i}^{\top}\theta)| δ>0\displaystyle\geq\delta>0 (5)

for all θ\theta in 2(θ,r)\mathcal{B}_{2}(\theta^{*},r), which is the 2\ell_{2}-ball of radius rr centered at θ\theta^{*}.

Let us check Assumption 4 in various simple cases.

  • In the case of linear regression, we have (z)=12z2\ell(z)=\frac{1}{2}z^{2}, (z)=z\ell^{\prime}(z)=z, ′′(z)=1\ell^{\prime\prime}(z)=1 and f(z)=zf(z)=z, f(z)=1f^{\prime}(z)=1, f′′(z)=0f^{\prime\prime}(z)=0. Thus, we get δ=1\delta=1.

  • In the case of the Huber loss and the Softplus function, we have
    (z)={12z2 for |z|ηη(|z|12η), otherwise \ell(z)=\begin{cases}\frac{1}{2}z^{2}&\text{ for }|z|\leq\eta\\ \eta\cdot\left(|z|-\frac{1}{2}\eta\right),&\text{ otherwise }\end{cases},     (z)={z for |z|ηη if z>ηη if z<η,\ell^{\prime}(z)=\begin{cases}z&\text{ for }|z|\leq\eta\\ \eta&\text{ if }z>\eta\\ -\eta&\text{ if }z<-\eta,\end{cases},     ′′(z)={1 for |z|η0 if |z|>η\ell^{\prime\prime}(z)=\begin{cases}1&\text{ for }|z|\leq\eta\\ 0&\text{ if }|z|>\eta\end{cases}
    and f(z)=1/(1+exp(z))f(z)=1/(1+\exp(-z)), f(z)=exp(z)/(1+exp(z))2f^{\prime}(z)=\exp(-z)/(1+\exp(-z))^{2} and f′′(z)=2e2z(ez+1)3ez(ez+1)2f^{\prime\prime}(z)=\frac{2e^{-2z}}{\left(e^{-z}+1\right)^{3}}-\frac{e^{-z}}{\left(e^{-z}+1\right)^{2}}.
    Moreover, f(z)/f′′(z)>1f^{\prime}(z)/f^{\prime\prime}(z)>1. Then, when η<1\eta<1, we can take δ=1η\delta=1-\eta as long as |Yif(Xiθ)|η|Y_{i}-f(X_{i}^{\top}\theta)|\leq\eta for all i=1,,ni=1,\ldots,n and all θ2(θ,r)\theta\in\mathcal{B}_{2}(\theta^{*},r).

Notation : We denote by si(A)s_{i}(A) the ii-th singular value of the matrix AA, in decreasing order, i.e. s1(A)s2(A)sn(A)s_{1}(A)\geq s_{2}(A)\geq\ldots\geq s_{n}(A). When sn(A)s_{n}(A) is non-zero, we denote by κn(A)=s1(A)sn(A)\kappa_{n}(A)=\frac{s_{1}(A)}{s_{n}(A)} the n-th condition number of AA.

2.2 Distance to empirical risk minimisers via Neuberger’s theorem

Our main result in this section is the following Theorem. Their proofs are given in Section A.2 and Section A.3 in the appendix.

Theorem 2.1.

(Underparametrised setting)

Let CKZC_{K_{Z}} and cKZc_{K_{Z}} be positive constants depending only on the subGaussian norm KZK_{Z}. Let α\alpha be a real in ]0,1[]0,1[. Assume that pp and nn are such that CKZ2p<(1α)2nC_{K_{Z}}^{2}p<(1-\alpha)^{2}n and that Assumption 4 is verified. Let

r\displaystyle r =C(′′,f,ε)pδ((1α)nCKZp),\displaystyle=\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{p}}{\delta\ \left((1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p}\right)}, (6)

where C(′′,f,ε)=6CC′′CfKεC_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}, with CC a positive absolute constant.

Then, with probability at least 12exp(cKZα2n)exp(p/2)1-2\exp{(-c_{K_{Z}}\alpha^{2}n)}-\exp{(-p/2)}, the unique solution θ^\hat{\theta} to the optimisation problem (3) satisfies

A(θ^θ)2\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2} r,\displaystyle\leq r,

where 2\|\cdot\|_{2} is the usual Euclidean norm.

Theorem 2.2.

(Overparametrised setting)

Assume that pp and nn are such that CKZ2n<(1α)2pC_{K_{Z}}^{2}\ n<(1-\alpha)^{2}p and that Assumption 4 is verified. Let

r\displaystyle r =C(′′,f,ε)nδ((1α)pCKZn).\displaystyle=\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{n}}{\delta\ \left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}. (7)

Then, there exists a first order stationary point θ^\hat{\theta} to the optimisation problem (3) such that, with probability larger than or equal to 12exp(cKZα2p)exp(n/2)1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\Big{(}-n/2\Big{)}}, we have

A(θ^θ)2\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2} r.\displaystyle\leq r. (8)
Remark 2.3.

Assume that A=IA=I (hence X=ZX=Z) for simplicity. In order to illustrate the previous results, consider the case where the noise level is of the same order as Xi2\|X_{i}\|_{2}, i.e. KεpK_{\varepsilon}\sim\sqrt{p}, and δ\delta is independent of rr, as in the linear case. In the underparametrised case, this implies that the error bound on θ^θ2\|\hat{\theta}-\theta^{*}\|_{2} grows like p/np/\sqrt{n}. In the overparametrised case, the error bound given by Theorem 2.2 is of order n\sqrt{n}. This is potentially much smaller than p\sqrt{p} which is the natural order for θ\|\theta^{*}\| in the absence of sparsity.

Note also that in the different setup of Candès and Plan [4], the rows of XX would have norm of the order ofp/n\sqrt{p/n} and the noise would have subgaussian constant KεK_{\varepsilon} independent of the dimensions nn and pp. Multiplying by n\sqrt{n}, we get a norm of the XiX_{i}’s of the order p\sqrt{p} and a subgaussian constant KεK_{\varepsilon} of the order n\sqrt{n} for the noise. With this parametrisation of KεK_{\varepsilon} in mind, in the underparametrised case, the bound is of the order of p\sqrt{p} and in the overparametrised case, the error bound given by Theorem 2.2 would be of the order n/pn/\sqrt{p}.

2.3 The case of the linear models

2.3.1 The case of linear regression

Let us apply these theorems on the well-known linear regression model. In the linear case f(Xih)=Xihf(X_{i}^{\top}h)=X_{i}^{\top}h, the loss (z)=12z2\ell(z)=\frac{1}{2}z^{2} is quadratic and the optimisation problem (3) is

θ^=argminθp12ni=1n(YiXiθ)2.\hat{\theta}=\mathrm{argmin}_{\theta\in\mathbb{R}^{p}}\ \frac{1}{2n}\sum_{i=1}^{n}(Y_{i}-X_{i}^{\top}\theta)^{2}.

We therefore have:

Corollary 2.4 (Underparametrised case: linear model).

Let

r\displaystyle r =Cεp((1α)nCKZp),\displaystyle=\frac{C_{\varepsilon}\sqrt{p}}{\left((1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p}\right)},

where Cε=6CKεC_{\varepsilon}=6\sqrt{C}K_{\varepsilon}. Under the same assumptions as Theorem 2.1, the unique solution θ^\hat{\theta} to the optimisation problem (3) satisfies

A(θ^θ)2\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2} r,\displaystyle\leq r,

with probability larger than or equal to 12exp(cKZα2n)exp(p/2)1-2\exp{(-c_{K_{Z}}\alpha^{2}n)}-\exp{(-p/2)}.

Proof.

Apply Theorem 2.1 with f(Xih)=Xihf(X_{i}^{\top}h)=X_{i}^{\top}h and (y)=12y2\ell(y)=\frac{1}{2}y^{2}. Then we have

  • \bullet

    (y)=y\ell^{\prime}(y)=y and ′′(y)=1\ell^{\prime\prime}(y)=1, hence C(k)=0C_{\ell^{(k)}}=0, k3k\geq 3.

  • \bullet

    f(Xih)=Xihf(X_{i}^{\top}h)=X_{i}^{\top}h, hence Cf=1C_{f^{\prime}}=1 and Cf(k)=0C_{f^{(k)}}=0 for k2k\geq 2.

Consequently, we have δ=1\delta=1 in the linear case and Assumption 4 is verified. Furthermore, the constant C(′′,f,ε)C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)} simplifies to 6CKε6\sqrt{C}K_{\varepsilon}, which is denoted by CεC_{\varepsilon}.

This concludes the proof. ∎

Corollary 2.5 (Overparametrised case: linear model).

Let

r\displaystyle r =Cεn((1α)pCKZn),\displaystyle=\frac{C_{\varepsilon}\sqrt{n}}{\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)},

where Cε=6CKεC_{\varepsilon}=6\sqrt{C}K_{\varepsilon}. Under the same assumptions as Theorem 2.2, there exists a solution θ^\hat{\theta} to the optimisation problem (3) such that, with probability larger than or equal to 12exp(cKZα2p)exp(n/2)1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{(-n/2)}, we have

A(θ^θ)2\displaystyle\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2} r.\displaystyle\leq r.

2.4 Discussion of the results

Our results are based on a new zero finding approach inspired from [14] and we obtain precise quantitative results in the finite sample setting for linear and non-linear models. Following the initial discovery of the ”double descent phenomenon” in [2], many authors have addressed the question of precisely characterising the error decay as a function of the number of parameters in the linear and non-linear setting (mostly based on random feature models). Some of the latest works, such as [11], address the problem in the asymptotic regime. Our results provide an explicit control of the distance between some empirical estimators and the ground truth in terms of the subGaussian norms of the error and the covariance matrix of the covariate vectors. Our analysis employs efficient techniques from [14], combined with various results from Random Matrix Theory and high dimensional probability, as summarized in [17] and extenstively discussed in [18]. Theorem 2.1 and Theorem 2.2 provide a new finite sample analysis of the problem of estimating ridge functions in both underparametrised and overparametrised regimes, i.e. where the number of parameters is smaller (resp. larger) than the sample size.

  • Our analysis of the underparametrised setting shows that we can obtain an error of order less than or equal to p/n\sqrt{p/n} for all pp up to an arbitrary fraction of nn.

  • In the overparametrised setting, we get that the error bound is approaching the noise level plus tKZAθ2t\ K_{Z}\|A^{\top}\theta^{*}\|_{2} as pp grows to ++\infty. This term can be small if the rank of AA is small, and even more so if θ\theta^{*} is concentrated in the null space of AA or approximatively so, which is reminiscent of the results in [16] and the recent work [9].

Similar but simpler bounds also hold in the linear model setting, as presented in Corollary 2.4 and Corollary 2.5.

The following section addresses the problem of studying the prediction error in probability of the least 2\ell_{2}-norm empirical risk minimiser in the overparamatrised setting.

3 Generalisation of least norm empirical risk minimisers

In this section, we leverage the results of the former section in order to study the prediction error of the least 2\ell_{2}-norm empirical risk minimiser. Our main result is Theorem 3.1 from the following section.

3.1 Main result

Using the previous results, we obtain the following theorem about benign overfitting in the overparametrised case.

Theorem 3.1.

Assume that θ^\hat{\theta} is interpolating, i.e. R^n(θ^)=0\hat{R}_{n}(\hat{\theta})=0. Let θ^\hat{\theta}^{\circ} denote the minimum norm interpolating solution to the ERM problem, i.e.

argminθθ2 subject to R^n(θ)=0.\displaystyle\text{argmin}_{\theta}\ \|\theta\|_{2}\quad\text{ subject to }\hat{R}_{n}(\theta)=0. (9)

Under the same assumptions as in Theorem 2.2, for any constant τ>0\tau>0, and for

r\displaystyle r =C(′′,f,ε)nδ((1α)pCKZn),\displaystyle=\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{n}}{\delta\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}, (10)

with C(′′,f,ε)=6CC′′CfKεC_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}, where CC is a positive absolute constant, we have

|f(Xn+1θ^)f(Xn+1θ)|\displaystyle|f(X_{n+1}^{\top}\hat{\theta}^{\circ})-f(X_{n+1}^{\top}\theta^{*})| Cf(|εn+1|+tKZAθ2+τ(n/p)log(n)+tC(′′,f,ε)nδ((1α)pCKZn))\displaystyle\leq C_{f^{\prime}}\left(\left|\varepsilon_{n+1}\right|+t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}+\sqrt{\tau(n/p)\log(n)}+t\ \frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}\right) (11)

with probability at least

12exp(cKZα2p)exp(n2)exp(cKZt2)\displaystyle 1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}-\exp(-c_{K_{Z}}t^{2})
exp(t22)4exp(cKZαp)exp(cC2n)2exp(τlog(n)(1αCKZn/p)28KZ2C2Kε2κn(A)2)\displaystyle\hskip 28.45274pt-\exp\left(-\frac{t^{2}}{2}\right)-4\exp{\left(-c_{K_{Z}}\alpha p\right)}-\exp\left(-cC^{\prime 2}n\right)-2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)

where cc and CC^{\prime} are absolute positive constants.

Proof.

See Section B. ∎

Remark 3.2.

For isotropic data, take A=IpA=I_{p}. Notice that, in accordance with the concept of double descent as discussed in [2], when nn grows (while staying smaller than pp in the overparametrised regime), the ratio n/pn/p approaches 11 from the right and the upper bound worsens.

3.2 Comparison with previous results

Recently, the finite sample analysis has been addressed in the very interesting works [1], [6] and [9] for the linear model only. These works propose very precise upper and lower bounds on the prediction risk for general covariate covariance matrices under the subGaussian assumption or the more restrictive Gaussian assumption. In [1], excess risk is considered and the norm of θ\|\theta^{*}\| appears in the proposed upper bound instead of Aθ2\|A^{\top}\theta^{*}\|_{2}, which is potentially much smaller if θ\theta^{*} has a large component in the kernel of AA or in spaces where the singular values of AA are small. The same issue arises in [6], where the noise is not assumed Gaussian nor subGaussian and can be dependent on the design matrix XX. In [9], the authors propose a very strong bound that incorporates information that is more general than Aθ2\|A\theta^{*}\|_{2}. However the setting studied in [9] is restricted to linear models and Gaussian observation noise. Studies on nonlinear models are scarce and a notable exception is [7], where the setting is very different from ours and the number of parameters is constrained in a more rigid fashion than in the present work.

In the present work, we show that similar, very precise results can be obtained for non-linear models of the ridge function class, using elementary perturbation results and some (now) standard random matrix theory. Moreover, our results concern the probability of exceeding a certain error level in predicting a new data point, whereas most results in the literature are restricted to the expected risk.

4 Conclusion and perspectives

This work presents a precise quantitative, finite sample analysis of the benign overfitting phenomenon in the estimation of linear and certain non-linear models. We make use of a zero-finding result of Neuberger [14] which can be applied to a large number of settings in machine learning.

Extending our work to the case of Deep Neural Networks is an exciting avenue for future research. Another possible direction is to include penalisation, which can be addressed using the same techniques via Karush-Kuhn-Tucker conditions. Applying this approach to Ridge Regression and 1\ell_{1}-penalised estimation would bring new proof techniques into the field to investigate potentially deeper results. Weakening the assumptions on our data, which are here of subGaussian type, could also lead to interesting new results; this could be achieved by utilising, e.g. the work of [13].

References

  • [1] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
  • [2] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • [3] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
  • [4] Emmanuel J Candès, Yaniv Plan, et al. Near-ideal model selection by 1\ell_{1} minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
  • [5] Alfonso Castro and JW Neuberger. An inverse function theorem via continuous newton’s method. 2001.
  • [6] Geoffrey Chinot and Matthieu Lerasle. Benign overfitting in the large deviation regime. arXiv preprint arXiv:2003.05838, 2020.
  • [7] Spencer Frei, Niladri S Chatterji, and Peter Bartlett. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022.
  • [8] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
  • [9] Guillaume Lecué and Zong Shang. A geometrical viewpoint on the benign overfitting property of the minimum l_2l\_2-norm interpolant estimator. arXiv preprint arXiv:2203.05873, 2022.
  • [10] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel” ridgeless” regression can generalize. arXiv preprint arXiv:1808.00387, 2018.
  • [11] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
  • [12] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  • [13] Shahar Mendelson. Extending the small-ball method. arXiv preprint arXiv:1709.00843, 2017.
  • [14] John W Neuberger. The continuous newton’s method, inverse functions, and nash-moser. The American Mathematical Monthly, 114(5):432–437, 2007.
  • [15] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics.
  • [16] Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286, 2020.
  • [17] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
  • [18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.

Appendix A Proofs of Theorem 2.1 and Theorem 2.2

A.1 Neuberger’s theorem

The following theorem of Neuberger [14] will be instrumental in our study of the ERM. In our context, this theorem can be restated as follows.

Theorem A.1 (Neuberger’s theorem for ERM).

Suppose that r>0r>0, that θp\theta^{\ast}\in\mathbb{R}^{p} and that the Jacobian DR^n()D\hat{R}_{n}(\cdot) is a continuous map on Br(θ)B_{r}(\theta^{\ast}), with the property that for each θ\theta in br(θ)b_{r}(\theta^{\ast}) there exists a vector dd in Br(0)B_{r}(0) such that,

limt0DR^n(θ+td)DR^n(θ)t=DR^n(θ).\displaystyle\lim_{t\downarrow 0}\ \frac{D\hat{R}_{n}(\theta+td)-D\hat{R}_{n}(\theta)}{t}=-D\hat{R}_{n}(\theta^{*}). (12)

Then there exists uu in Br(θ)B_{r}(\theta^{\ast}) such that DR^n(u)=0D\hat{R}_{n}(u)=0.

A.1.1 Computing the derivatives

Since the loss is twice differentiable, the empirical risk R^n\hat{R}_{n} is itself twice differentiable. The Gradient of the empirical risk is given by

R^n(θ)\displaystyle\nabla\hat{R}_{n}(\theta) =1ni=1n(Yif(Xiθ))f(Xiθ)Xi.\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\ \ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{\top}\theta)X_{i}. (13)

Hence, for θ=θ\theta=\theta^{*}, and recalling that Yif(Xiθ)=εiY_{i}-f(X_{i}^{\top}\theta^{*})=\varepsilon_{i}, we get

=1nXDν(ε),\displaystyle=-\frac{1}{n}X^{\top}D_{\nu}\ \ell^{\prime}(\varepsilon), (14)

where (ε)\ell^{\prime}(\varepsilon) is to be understood componentwise, and DνD_{\nu} is a diagonal matrix with coefficients νi=f(Xiθ)\nu_{i}=f^{\prime}(X_{i}^{\top}\theta^{\ast}) for all ii in 1,,n1,\ldots,n.

The Hessian is given by

2R^n(θ)\displaystyle\nabla^{2}\hat{R}_{n}(\theta) =1ni=1n(′′(Yif(Xiθ))f(Xitθ)2(Yif(Xiθ))f′′(Xiθ))XiXit\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ \Big{(}\ell^{\prime\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{t}\theta)^{2}-\ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime\prime}(X_{i}^{\top}\theta)\Big{)}X_{i}X_{i}^{t}
=1nXDμX,\displaystyle=\frac{1}{n}\ X^{\top}D_{\mu}X, (15)

where DμD_{\mu} is a diagonal matrix given by, for all ii in 1,,n1,\ldots,n

μi\displaystyle\mu_{i} =′′(Yif(Xiθ))f(Xitθ)2(Yif(Xiθ))f′′(Xiθ).\displaystyle=\ell^{\prime\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime}(X_{i}^{t}\theta)^{2}-\ell^{\prime}(Y_{i}-f(X_{i}^{\top}\theta))\ f^{\prime\prime}(X_{i}^{\top}\theta).

Notice that when the μ1,,μn\mu_{1},\ldots,\mu_{n} are all non-negative for all θp\theta\in\mathbb{R}^{p}, the empirical risk is convex. The condition we have to satisfy in order to use Neuberger’s theorem, i.e. the version of (12) associated with our setting, is the following

2R^n(θ)d\displaystyle\nabla^{2}\hat{R}_{n}(\theta)d =R^n(θ).\displaystyle=-\nabla\hat{R}_{n}(\theta^{\ast}).

A.1.2 A change of variable

Since

R^n(θ)\displaystyle\hat{R}_{n}(\theta) =1ni=1n(Yif(Xiθ)).\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(X_{i}^{\top}\theta)). (16)

and Xi=ZiAX_{i}^{\top}=Z_{i}^{\top}A^{\top}, the risk can be written as a function of AθA\theta as follows:

R^n(A)(θ(A))\displaystyle\hat{R}^{(A)}_{n}(\theta^{(A)}) =1ni=1n(Yif(Ziθ(A)))\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ell(Y_{i}-f(Z_{i}^{\top}\theta^{(A)})) (17)

with θ(A)=Aθ\theta^{(A)}=A^{\top}\theta. Clearly, minimizing R^n\hat{R}_{n} in θ\theta is equivalent to minimizing R^n(A)\hat{R}^{(A)}_{n} in θ(A)\theta^{(A)} up to equivalence modulo the kernel of AA^{\top}.

R^n(A)(θ(A))\displaystyle\nabla\hat{R}^{(A)}_{n}(\theta^{(A)}) =1ni=1n(Yif(Ziθ(A)))f(Ziθ(A))Zi.\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\ \ell^{\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime}(Z_{i}^{\top}\theta^{(A)})Z_{i}. (18)
=1nZDν(ε),\displaystyle=-\frac{1}{n}Z^{\top}D_{\nu}\ \ell^{\prime}(\varepsilon), (19)
2R^n(A)(θ(A))\displaystyle\nabla^{2}\hat{R}^{(A)}_{n}(\theta^{(A)}) =1ni=1n(′′(Yif(Ziθ(A)))f(Ziθ(A))2(Yif(Ziθ(A)))f′′(Ziθ(A)))ZiZi\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\ \Big{(}\ell^{\prime\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime}(Z_{i}^{\top}\theta^{(A)})^{2}-\ell^{\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime\prime}(Z_{i}^{\top}\theta^{(A)})\Big{)}Z_{i}Z_{i}^{\top}
=1nZDμZ,\displaystyle=\frac{1}{n}\ Z^{\top}D_{\mu}Z, (20)

where DμD_{\mu} is a diagonal matrix given by, for all ii in 1,,n1,\ldots,n

μi\displaystyle\mu_{i} =′′(Yif(Ziθ(A)))f(Ziθ(A))2(Yif(Ziθ(A)))f′′(Ziθ(A)).\displaystyle=\ell^{\prime\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime}(Z_{i}^{\top}\theta^{(A)})^{2}-\ell^{\prime}(Y_{i}-f(Z_{i}^{\top}\theta^{(A)}))\ f^{\prime\prime}(Z_{i}^{\top}\theta^{(A)}).

The Neuberger equation in θ(A)\theta^{(A)} is now given by

limt0DR^n(A)(θ(A)+td)DR^n(A)(θ(A))t=DR^n(θ(A)),\displaystyle\lim_{t\downarrow 0}\ \frac{D\hat{R}^{(A)}_{n}(\theta^{(A)}+td)-D\hat{R}^{(A)}_{n}(\theta^{(A)})}{t}=-D\hat{R}_{n}(\theta^{(A)^{*}}), (21)

where θ(A)=Aθ\theta^{(A)^{*}}=A^{\top}\theta^{*}.

A.1.3 A technical lemma

Lemma A.2.

For all i=1,,ni=1,\ldots,n, the variable (εi)\ell^{\prime}(\varepsilon_{i}) is subGaussian with variance proxy (εi)ψ2=C′′εiψ2.\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}}=C_{\ell^{\prime\prime}}\ \|\varepsilon_{i}\|_{\psi_{2}}.

Proof.

Let us compute

(εi)ψ2\displaystyle\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}} =supγ1γ1/2(𝔼|(εi)|γ)1/γ.\displaystyle=\sup_{\gamma\geq 1}\gamma^{-1/2}\Big{(}\mathbb{E}|\ell^{\prime}(\varepsilon_{i})|^{\gamma}\Big{)}^{1/\gamma}.

Lipschitzianity of \ell^{\prime} implies that

|(εi)(0)|\displaystyle|\ell^{\prime}(\varepsilon_{i})-\ell^{\prime}(0)| C′′|εi0|\displaystyle\leq C_{\ell^{\prime\prime}}|\varepsilon_{i}-0|

and since (0)=0\ell^{\prime}(0)=0, we get

|(εi)|\displaystyle|\ell^{\prime}(\varepsilon_{i})| C′′|εi|,\displaystyle\leq C_{\ell^{\prime\prime}}|\varepsilon_{i}|,

which implies that

(εi)ψ2\displaystyle\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}} =C′′supγ1γ1/2(𝔼|εi|γ)1/γ.\displaystyle=C_{\ell^{\prime\prime}}\ \sup_{\gamma\geq 1}\gamma^{-1/2}\Big{(}\mathbb{E}|\varepsilon_{i}|^{\gamma}\Big{)}^{1/\gamma}.

Thus,

(εi)ψ2\displaystyle\|\ell^{\prime}(\varepsilon_{i})\|_{\psi_{2}} =C′′εiψ2<+\displaystyle=C_{\ell^{\prime\prime}}\ \|\varepsilon_{i}\|_{\psi_{2}}<+\infty

as announced. The proof is complete. ∎

A.2 Proof of Theorem 2.1: The underparametrised case

A.2.1 Key lemma

Using Corollary C.2, we have sn(Z)>0s_{n}(Z)>0 as long as

CKZ2p<(1α)2n\displaystyle C_{K_{Z}}^{2}p<(1-\alpha)^{2}n

for some positive constant CKZC_{K_{Z}} depending on the subGaussian norm KZK_{Z} of Z1,,ZnZ_{1},\ldots,Z_{n}.

Lemma A.3.

With probability larger than or equal to 1exp(p2),1-\exp{\Big{(}-\frac{p}{2}\Big{)}}, we have

d2\displaystyle\|d\|_{2} C(′′,f,ε)pδsn(Z),\displaystyle\leq\frac{C_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}\sqrt{p}}{\delta s_{n}(Z)},

where C(′′,f,ε)=6CC′′CfKεC_{(\ell^{\prime\prime},f^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}.

Proof.

We compute the solution of Neuberger’s equation (21), using the Jacobian vector formula in (19) and the Hessian matrix formula in (20):

2R^n(A)(θ(A))d\displaystyle\nabla^{2}\hat{R}^{(A)}_{n}(\theta^{(A)})d =R^n(A)(Aθ),\displaystyle=-\nabla\hat{R}^{(A)}_{n}(A^{\top}\theta^{*}),

which gives

ZDμZd\displaystyle Z^{\top}D_{\mu}Zd =ZDν(ε).\displaystyle=Z^{\top}D_{\nu}\ell^{\prime}(\varepsilon).

The singular value decomposition of ZZ gives Z=UΣVZ=U\Sigma V^{\top}, where Un×pU\in\mathbb{R}^{n\times p} is a matrix whose columns form an orthonormal family, Vp×pV\in\mathbb{R}^{p\times p} is an orthogonal matrix and Σp×p\Sigma\in\mathbb{R}^{p\times p} is a diagonal and invertible matrix. Thus, we obtain the equivalent equation

VΣUDμUΣVd\displaystyle V\Sigma U^{\top}D_{\mu}U\Sigma V^{\top}d =VΣUDν(ε).\displaystyle=V\Sigma U^{\top}D_{\nu}\ell^{\prime}(\varepsilon).

Note that any solution of the system

UDμUΣVd\displaystyle U^{\top}D_{\mu}U\Sigma V^{\top}d =UDν(ε),\displaystyle=U^{\top}D_{\nu}\ell^{\prime}(\varepsilon),

will be admissible. Hence,

Vd\displaystyle V^{\top}d =Σ1(UDμU)1UDν(ε).\displaystyle=\Sigma^{-1}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon).

Then, as d2=Vd2\|d\|_{2}=\|V^{\top}d\|_{2},

d2\displaystyle\big{\|}d\big{\|}_{2} Σ1(UDμU)1UDν(ε)2\displaystyle\leq\big{\|}\Sigma^{-1}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon)\big{\|}_{2}
Σ12(UDμU)1UDν(ε)2\displaystyle\leq\big{\|}\Sigma^{-1}\Big{\|}_{2}\Big{\|}(U^{\top}D_{\mu}U)^{-1}U^{\top}D_{\nu}\ell^{\prime}(\varepsilon)\big{\|}_{2}
1sn(Z)UDμ1Dν(ε)2.\displaystyle\leq\frac{1}{s_{n}(Z)}\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}. (22)

Using maximal inequalities, we can now bound the deviation probability of

UDμ1Dν(ε)2\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2} =maxu2=1uUDμ1Dν(ε),\displaystyle=\max_{\|u\|_{2}=1}\ u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon),

with upu\in\mathbb{R}^{p}. For this purpose, we first prove that UDμ1Dν(ε)U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon) is a subGaussian vector and provide an explicit upper bound on its norm. Since UU is a n×pn\times p matrix whose columns form an orthonormal family, uUu^{\top}U^{\top} is a unit-norm 2\ell_{2} vector of size nn which is denoted by ww. Then,

uUDμ1Dν(ε)\displaystyle u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon) =i=1nwiνiμi(εi),\displaystyle=\sum_{i=1}^{n}\ w_{i}\frac{\nu_{i}}{\mu_{i}}\ \ell^{\prime}(\varepsilon_{i}),

and since (εi)\ell^{\prime}(\varepsilon_{i}) is centered and subGaussian for all i=1,,ni=1,\ldots,n, Vershynin’s [18, Proposition 2.6.1] gives

uUDμ1Dν(ε)ψ22\displaystyle\|u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{\psi_{2}}^{2} Ci=1nwiνiμi(εi)ψ22,\displaystyle\leq C\sum_{i=1}^{n}\ \left\|w_{i}\frac{\nu_{i}}{\mu_{i}}\ \ell^{\prime}(\varepsilon_{i})\right\|_{\psi_{2}}^{2},

for some absolute constant CC. Since

wiνiμi(εi)ψ22\displaystyle\left\|w_{i}\frac{\nu_{i}}{\mu_{i}}\ell^{\prime}(\varepsilon_{i})\right\|_{\psi_{2}}^{2} |wi|2maxi=1nνiμi(εi)ψ22,\displaystyle\leq|w_{i}|^{2}\ \max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2},

we have

uUDμ1Dν(ε)ψ22\displaystyle\|u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{\psi_{2}}^{2} Cw22maxi=1nνiμi(εi)ψ22.\displaystyle\leq C\|w\|_{2}^{2}\ \max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2}.

As w2=1\|w\|_{2}=1, we get that for all upu\in\mathbb{R}^{p} with u2=1\|u\|_{2}=1,

uUDμ1Dν(ε)ψ22\displaystyle\|u^{\top}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{\psi_{2}}^{2} Cmaxi=1nνiμi(εi)ψ22.\displaystyle\leq C\max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2}.

We deduce that UDμ1Dν(ε)U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon) is a subGaussian random vector with variance proxy

Cmaxi=1nνiμi(εi)ψ22.\displaystyle C\max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}^{2}.

Using the maximal inequality from [15, Theorem 1.19] and the subGaussian properties described in [18, Proposition 2.5.2], for any η>0\eta>0, we get with probability 1exp(η22)1-\exp{\Big{(}-\frac{\eta^{2}}{2}\Big{)}}

UDμ1Dν(ε)2\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2} 2Cmaxi=1nνiμi(εi)ψ2(2p+η).\displaystyle\leq 2\sqrt{C}\max_{i^{\prime}=1}^{n}\ \left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ \ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}}(2\sqrt{p}+\eta). (23)

Following Lemma A.2, we deduce that νiμi(εi)\frac{\nu_{i}}{\mu_{i}}\ell^{\prime}(\varepsilon_{i^{\prime}}) is a subGaussian random variable with variance proxy

νiμi(εi)ψ2\displaystyle\left\|\frac{\nu_{i^{\prime}}}{\mu_{i^{\prime}}}\ell^{\prime}(\varepsilon_{i^{\prime}})\right\|_{\psi_{2}} =|νi||μi|C′′εiψ2\displaystyle=\frac{|\nu_{i^{\prime}}|}{|\mu_{i^{\prime}}|}C_{\ell^{\prime\prime}}\|\varepsilon_{i^{\prime}}\|_{\psi_{2}}
C′′maxin|νi|minin|μi|Kε.\displaystyle\leq C_{\ell^{\prime\prime}}\frac{\max_{i^{\prime}}^{n}|\nu_{i^{\prime}}|}{\min_{i^{\prime}}^{n}|\mu_{i^{\prime}}|}K_{\varepsilon}.

Then, restricting a priori θ(A)\theta^{(A)} to lie in the ball 2(θ(A),r)\mathcal{B}_{2}(\theta^{(A)^{*}},r), which is coherent with the assumptions of Newberger’s Theorem A.1, and using the assumption 4, together with the boundedness of ff^{\prime}, we get

maxi=1n|νi|mini=1n|μi|Cfδ.\displaystyle\frac{\max_{i^{\prime}=1}^{n}|\nu_{i^{\prime}}|}{\min_{i^{\prime}=1}^{n}|\mu_{i^{\prime}}|}\leq\frac{C_{f^{\prime}}}{\delta}.

Subsequently, the quantity (23) becomes

UDμ1Dν(ε)2\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2} 2CC′′maxi=1n|νi|mini=1n|μi|Kε(2p+η)\displaystyle\leq 2\sqrt{C}C_{\ell^{\prime\prime}}\frac{\max_{i^{\prime}=1}^{n}|\nu_{i^{\prime}}|}{\min_{i^{\prime}=1}^{n}|\mu_{i^{\prime}}|}K_{\varepsilon}(2\sqrt{p}+\eta)
2CC′′CfKε(2p+η)δ,\displaystyle\leq\frac{2\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}(2\sqrt{p}+\eta)}{\delta},

with probability 1exp(η22)1-\exp{\left(-\frac{\eta^{2}}{2}\right)}. Taking η=p\eta=\sqrt{p}, equation (22) yields

d2=2R^n(θ)1R^n(θ)26CC′′CfKεpδsn(Z),\displaystyle\|d\|_{2}=\Big{\|}\nabla^{2}\hat{R}_{n}(\theta)^{-1}\ \nabla\hat{R}_{n}(\theta^{*})\Big{\|}_{2}\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{p}}{\delta\ s_{n}(Z)},

with probability 1exp(p2)1-\exp{\left(-\frac{p}{2}\right)}. ∎

A.2.2 End of the proof of Theorem 2.1

In order to complete the proof of Theorem 2.1, note that, using Corollary C.2, we have with probabilty 12exp(cKZα2n)1-2\exp{(-c_{K_{Z}}\alpha^{2}n)}

sn(Z)(1α)nCKZp,\displaystyle s_{n}(Z)\geq(1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p},

where CKZ,cKZ>0C_{K_{Z}},c_{K_{Z}}>0 depend only on the subGaussian norm KZK_{Z}. Therefore, with probability larger than or equal to

1\displaystyle 1 2exp(cKZα2n)exp(p2),\displaystyle-2\exp{(-c_{K_{Z}}\alpha^{2}n)}-\exp{\left(-\frac{p}{2}\right)},

we have

d2\displaystyle\|d\|_{2} 6CC′′CfKεpδ((1α)nCKZp).\displaystyle\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{p}}{\delta\left((1-\alpha)\sqrt{n}-C_{K_{Z}}\sqrt{p}\right)}.

Hence, the proof of Theorem 2.1 is completed.

A.3 Proof of Theorem 2.2: The overparametrised case

A.3.1 Key lemma

Using Theorem C.3, we have sn(Z)>0s_{n}(Z^{\top})>0 as long as

CKZ2n<(1α)2p.\displaystyle C_{K_{Z}}^{2}\ n<(1-\alpha)^{2}p.

for some positive constant CKZC_{K_{Z}} depending on the subGaussian norm KZK_{Z} of Z1,,ZnZ_{1},\ldots,Z_{n}.

Lemma A.4.

With probability larger than or equal to 1exp(n2)1-\exp{\left(-\frac{n}{2}\right)}, we have

d2\displaystyle\|d\|_{2} C(f′′,,ε)nδsn(Z).\displaystyle\leq\frac{C_{(f^{\prime\prime},\ell^{\prime},\varepsilon)}\sqrt{n}}{\delta\ s_{n}(Z^{\top})}.

where C(f′′,,ε)=6CC′′CfKεC_{(f^{\prime\prime},\ell^{\prime},\varepsilon)}=6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}.

Proof.

As in the underparametrised case, we have to solve (12) i.e

ZDμZd=ZDν(ε),\displaystyle Z^{\top}D_{\mu}Zd=Z^{\top}D_{\nu}\ell^{\prime}(\varepsilon),

which can be solved by finding the least norm solution of the interpolation problem

DμZd\displaystyle D_{\mu}Zd =Dν(ε),\displaystyle=D_{\nu}\ell^{\prime}(\varepsilon), (24)

i.e.

d\displaystyle d =ZDμ1Dν(ε)\displaystyle=Z^{\dagger}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)

where ZZ^{\dagger} is the Moore-Penrose pseudo-inverse of ZZ.

Given the compact SVD of Z=UΣVZ=U\Sigma V^{\top}, where UO(n)U\in O(n) and Vp×nV\in\mathbb{R}^{p\times n} with orthonormal columns, we get

DμUΣVd\displaystyle D_{\mu}U\Sigma V^{\top}d =Dν(ε),\displaystyle=D_{\nu}\ell^{\prime}(\varepsilon),

We then have

Vd2\displaystyle\|V^{\top}d\|_{2} =Σ1UDμ1Dν(ε)2.\displaystyle=\|\Sigma^{-1}U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}.

As x2=Vx2\|x\|_{2}=\|Vx\|_{2} for all xnx\in\mathbb{R}^{n},

d2\displaystyle\|d\|_{2} 1sn(Z)UDμ1Dν(ε)2.\displaystyle\leq\frac{1}{s_{n}(Z^{\top})}\|U^{\top}D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}.

Then, using the bound

UDμ1Dν(ε)22CC′′CfKε(2n+η)δ\displaystyle\|U^{\top}\ D_{\mu}^{-1}D_{\nu}\ell^{\prime}(\varepsilon)\|_{2}\leq\frac{2\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}(2\sqrt{n}+\eta)}{\delta}

with probability 1exp(η22)1-\exp{\left(-\frac{\eta^{2}}{2}\right)}, which can be obtained in the same way as (23) for the underparametrised case, and taking η=n\eta=\sqrt{n}, we get

d2\displaystyle\|d\|_{2} 6CC′′CfKεnδsn(Z),\displaystyle\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta\ s_{n}(Z^{\top})}, (25)

with probability 1exp(n2)1-\exp{\left(-\frac{n}{2}\right)}. ∎

A.3.2 End of the proof of Theorem 2.2

In order to complete the proof of Theorem 2.2, note that, using Theorem C.3, we have with probability 12exp(cKZα2p)1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}

sn(Z)(1α)pCKZn.\displaystyle s_{n}(Z^{\top})\geq(1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}. (26)

Therefore, with probability larger than or equal to

1\displaystyle 1 2exp(cKZα2p)exp(n2),\displaystyle-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)},

we have

d2\displaystyle\|d\|_{2} 6CC′′CfKεnδ((1α)pCKZn).\displaystyle\leq\frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}.

Hence the proof of Theorem 2.2 is completed.

Appendix B Proof of Theorem 3.1

B.1 First part of the proof

Since θ^\hat{\theta} and θ^\hat{\theta}^{\circ} are two interpolating minimisers of the empirical risk, i.e. satisfy Yif(Xiθ^)=0Y_{i}-f(X_{i}^{\top}\hat{\theta})=0 and Yif(Xiθ^)=0Y_{i}-f(X_{i}^{\top}\hat{\theta}^{\circ})=0 for all ii in 1,,n1,\ldots,n, then using that ff is strictly monotonic, we get

Xθ^\displaystyle X\hat{\theta}^{\circ} =Xθ^.\displaystyle=X\hat{\theta}. (27)

B.2 Second part of the proof

Recall that θ^\hat{\theta}^{\circ} denote the minimum norm solution to the ERM, i.e.

argminθθ2 subject to Xθ=Xθ^.\displaystyle\text{argmin}_{\theta}\ \|\theta\|_{2}\quad\text{ subject to }X\theta=X\hat{\theta}.

Let θ^\hat{\theta}^{\sharp} solves

argminθθ2 subject to [XXn+1]θ=[XXn+1]θ^,\displaystyle\text{argmin}_{\theta}\ \|\theta\|_{2}\quad\text{ subject to }\begin{bmatrix}X\\ X_{n+1}^{\top}\end{bmatrix}\theta=\begin{bmatrix}X\\ X_{n+1}^{\top}\end{bmatrix}\hat{\theta},

where θ^\hat{\theta} is the solution exhibited in Theorem 2.2. Then,

|Xn+1(θ^θ)|\displaystyle|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\theta^{*})| |Xn+1(θ^θ^)|+|Xn+1(θ^θ^)=0 by definition|+|Xn+1(θ^θ)|,\displaystyle\leq|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})|+|\underbrace{X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta})}_{=0\text{ by definition}}|+|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})|,
|Xn+1(θ^θ^)|+|Xn+1(θ^θ)|\displaystyle\leq|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})|+|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})| (28)

and since Xn+1X_{n+1} and θ^θ\hat{\theta}-\theta^{*} are independent, we have

(|Xn+1(θ^θ)|tA(θ^θ)2)\displaystyle\mathbb{P}(|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})|\geq t\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}) =(|Zn+1A(θ^θ)|tA(θ^θ)2)\displaystyle=\mathbb{P}(|Z_{n+1}^{\top}A^{\top}(\hat{\theta}-\theta^{*})|\geq t\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2})
exp(cKZt2).\displaystyle\leq\exp(-c_{K_{Z}}t^{2}).

Now, by Theorem 2.2, about the bound on A(θ^θ)2\|A^{\top}(\hat{\theta}-\theta^{*})\|_{2}, we get

|Xn+1(θ^θ)|\displaystyle|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})| t6CC′′CfKεnδ((1α)pCKZn),\displaystyle\leq t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}, (29)

with probability at least

1\displaystyle 1 2exp(cKZα2p)exp(n2)exp(cKZt2).\displaystyle-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}{\color[rgb]{0,0,0}-\exp(-c_{K_{Z}}t^{2})}.

B.3 Third part of the proof: Block pseudo-inverse computation

Since θ^\hat{\theta}^{\sharp} is the minimum 2\ell_{2}-norm solution to the under-determined system

[XXn+1]θ=[y1:nyn+1],\displaystyle\left[\begin{array}[]{l}X\\ X_{n+1}^{\top}\end{array}\right]\theta=\left[\begin{array}[]{l}y_{1:n}\\ y_{n+1}\end{array}\right],

then we have

θ^\displaystyle\hat{\theta}^{\sharp} =(X(I1pXn+1Xn+1))y1:n+(Xn+1(IX(XX)X))yn+1.\displaystyle=\left(X\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)\right)^{\dagger}y_{1:n}+\left(X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\right)^{\dagger}y_{n+1}.

This gives

θ^\displaystyle\hat{\theta}^{\sharp} =(I1pXn+1Xn+1)Xy1:n+(IX(XX)X)Xn+1yn+1,\displaystyle=\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)X^{\dagger}y_{1:n}+\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\top^{\dagger}}y_{n+1}, (30)

since

(I1pXn+1Xn+1)=(I1pXn+1Xn+1)\displaystyle\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)^{\dagger}=\left(I-\frac{1}{p}X_{n+1}X_{n+1}^{\top}\right)

and

(Xn+1(IX(XX)1X))=(Xn+1(IX(XX)X)).\displaystyle\left(X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{-1}X\right)\right)^{\dagger}=\left(X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\right).

B.4 Fourth part of the proof

Since θ^=Xy1:n\hat{\theta}^{\circ}=X^{\dagger}y_{1:n}, we get from the expression of θ^\hat{\theta}^{\sharp} in (30) that

|Xn+1(θ^θ^)|\displaystyle|X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta}^{\circ})| =|1pXn+1Xn+1Xn+1Xy1:n+Xn+1(IX(XX)X)Xn+1yn+1|\displaystyle=\left|-\frac{1}{p}X_{n+1}^{\top}X_{n+1}X_{n+1}^{\top}X^{\dagger}y_{1:n}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\dagger}y_{n+1}\right|
|Xn+1X(XX)Xθ+Xn+1(IX(XX)X)Xn+1Xn+1θ|\displaystyle\leq\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}X\theta^{*}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)X_{n+1}^{\dagger}X_{n+1}^{\top}\theta^{*}\right|
+|Xn+1X(XX)ε|+|εn+1|\displaystyle\hskip 56.9055pt+\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|+\left|\varepsilon_{n+1}\right|
|Xn+1X(XX)Xθ+Xn+1(IX(XX)X)θ|\displaystyle\leq\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}X\theta^{*}+X_{n+1}^{\top}\left(I-X^{\top}\left(XX^{\top}\right)^{\dagger}X\right)\theta^{*}\right|
+|Xn+1X(XX)ε|+|εn+1|,\displaystyle\hskip 56.9055pt+\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|+\left|\varepsilon_{n+1}\right|,

which gives

|Xn+1(θ^θ^)|\displaystyle|X_{n+1}^{\top}(\hat{\theta}^{\sharp}-\hat{\theta}^{\circ})| |Xn+1θ|+|Xn+1X(XX)ε|+|εn+1|.\displaystyle\leq\left|X_{n+1}^{\top}\theta^{*}\right|+\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|+\left|\varepsilon_{n+1}\right|. (31)

B.5 Fifth part of the proof

In this section, we control of each term in the RHS of (31).

Control of |Xn+1θ|\left|X_{n+1}^{\top}\theta^{*}\right|

We have

(|θ,Xn+1|tKXθ2)=(|Aθ,Zn+1|tKZAθ2)\displaystyle\mathbb{P}(|\langle\theta^{*},X_{n+1}\rangle|\geq t\ K_{X}\|\theta^{*}\|_{2})=\mathbb{P}(|\langle A^{\top}\theta^{*},Z_{n+1}\rangle|\geq t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}) exp(t22).\displaystyle\leq\exp\left(-\frac{t^{2}}{2}\right). (32)

Control of |Xn+1X(XX)ε|\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|

Let us compute an upper bound on the norm of X(XX)εX^{\top}(XX^{\top})^{\dagger}\varepsilon which holds with large probability. We have

X(XX)ε22\displaystyle\|X^{\top}(XX^{\top})^{\dagger}\varepsilon\|_{2}^{2} =ε(XX)ε=ε(ZAAZ)ε\displaystyle=\varepsilon^{\top}(XX^{\top})^{\dagger}\varepsilon=\varepsilon^{\top}(ZA^{\top}AZ^{\top})^{\dagger}\varepsilon

Our main tool will be the following result from Vershynin’s book [18, Exercise 6.3.5]

((XX)1/2ε2CKε(XX)1/2F+tX)exp(ct2Kε2(XX)1/22),\displaystyle\mathbb{P}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq CK_{\varepsilon}\|(XX^{\top})^{\dagger^{1/2}}\|_{F}+t\mid X\right)\leq\exp\left(-\frac{ct^{2}}{K_{\varepsilon}^{2}\|(XX^{\top})^{\dagger^{1/2}}\|^{2}}\right), (33)

where CC^{\prime} and cc are absolute positive constants.

Let us introduce the Singular Value Decomposition of

A=U(A)Σ(A)V(A)\displaystyle A=U_{(A)}\Sigma_{(A)}V_{(A)}^{\top}

with Σ(A)\Sigma_{(A}) is a r(A)×r(A)r_{(A)}\times r_{(A)} matrix, where r(A)r_{(A)} is the rank of AA. We then have

XX\displaystyle XX^{\top} =ZAAZ=Z(U(A)Σ(A)V(A))(U(A)Σ(A)V(A))Z=ZV(A)Σ(A)U(A)U(A)Σ(A)V(A)Z\displaystyle=ZA^{\top}AZ^{\top}=Z(U_{(A)}\Sigma_{(A)}V_{(A)}^{\top})^{\top}(U_{(A)}\Sigma_{(A)}V_{(A)}^{\top})Z^{\top}=ZV_{(A)}\Sigma_{(A)}U_{(A)}^{\top}U_{(A)}\Sigma_{(A)}V_{(A)}^{\top}Z^{\top}
=ZV(A)Σ(A)2V(A)Zbecause U has orthogonal columns\displaystyle=ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top}\quad\text{because $U$ has orthogonal columns}

Then we have

(XX)1/2F2\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\|_{F}^{2} =(ZV(A)Σ(A)2V(A)Z)1/2F2=i=1nλi((ZV(A)Σ(A)2V(A)Z)),\displaystyle=\left\|\left(ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top}\right)^{\dagger^{1/2}}\right\|_{F}^{2}=\sum_{i=1}^{n}\lambda_{i}\left(\left(ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top}\right)^{\dagger}\right),

and since λi((BB))=si(B)2\lambda_{i}((BB^{\top}))=s_{i}(B)^{2} for some matrix BB:

(XX)12F2\displaystyle\|(XX^{\top})^{\dagger^{\frac{1}{2}}}\|_{F}^{2} =i=1nsi(ZV(A)Σ(A))2\displaystyle=\sum_{i=1}^{n}s_{i}\left(ZV_{(A)}\Sigma_{(A)}\right)^{-2}
nsn(Σ(A)V(A)TZT)2\displaystyle\leq n\ s_{n}(\Sigma_{(A)}V_{(A)}^{T}Z^{T})^{-2}
nsn(Σ(A))2sn(V(A)Z)2\displaystyle\leq n\ s_{n}(\Sigma_{(A)})^{-2}\ s_{n}(V_{(A)}^{\top}Z^{\top})^{-2}

where sns_{n} denotes the nthn^{th} largest singular value. Since the columns of V(A)V_{(A)} form an othornomal family, we get

(XX)12F2\displaystyle\|(XX^{\top})^{\dagger^{\frac{1}{2}}}\|_{F}^{2} nsn(Z)2sn(Σ(A))2nsn(Z)2sn(A)2.\displaystyle\leq n\ s_{n}(Z^{\top})^{-2}s_{n}(\Sigma_{(A)})^{-2}\leq n\ s_{n}(Z^{\top})^{-2}s_{n}(A)^{-2}.

Hence

(XX)1/2F\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\|_{F} nsn(Z)1sn(A)1.\displaystyle\leq\sqrt{n}\ s_{n}(Z^{\top})^{-1}s_{n}(A)^{-1}.

Since (XX)1/2(XX^{\top})^{\dagger^{1/2}} is symmetric, we have

(XX)1/2\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\| =λ1((XX)1/2)=λ1(((ZV(A)Σ(A)2V(A)Z))1/2)\displaystyle=\lambda_{1}((XX^{\top})^{\dagger^{1/2}})=\lambda_{1}(((ZV_{(A)}\Sigma_{(A)}^{2}V_{(A)}^{\top}Z^{\top})^{\dagger})^{1/2})
=s1((ZV(A)Σ(A)))\displaystyle=s_{1}((ZV_{(A)}\Sigma_{(A)})^{\dagger})
=1sn(Σ(A)V(A)ZT)\displaystyle=\frac{1}{s_{n}(\Sigma_{(A)}V_{(A)}Z^{T})}
1sn(Σ(A))sn(ZT)\displaystyle\leq\frac{1}{s_{n}(\Sigma_{(A)})s_{n}(Z^{T})}
=1sn(ZT)sn(A).\displaystyle=\frac{1}{s_{n}(Z^{T})s_{n}(A)}.

Define the event

Z\displaystyle\mathcal{E}_{Z} ={sn(Z)((1α)pCKZn)}.\displaystyle=\left\{s_{n}\left(Z^{\top}\right)\geq\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)\right\}.

Then, following Theorem C.3

(Zc)\displaystyle\mathbb{P}(\mathcal{E}_{Z}^{c}) 2exp(cKZαp).\displaystyle\leq 2\exp\left(-c_{K_{Z}}\alpha p\right).

Let us notice that, on this event Z\mathcal{E}_{Z}, we have

(XX)1/2F\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\|_{F} nsn(A)((1α)pCKZn)\displaystyle\leq\frac{\sqrt{n}}{s_{n}(A)\ \left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}
(XX)1/2\displaystyle\|(XX^{\top})^{\dagger^{1/2}}\| 1sn(A)((1α)pCKZn).\displaystyle\leq\frac{1}{s_{n}(A)\ \left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}.

Using (33) after setting

t\displaystyle t =nCKεsn(A)((1α)pCKZn),\displaystyle=\sqrt{n}\ \frac{C^{\prime}K_{\varepsilon}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)},

and conditioning on ZZ, we get

ε((XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z)exp(cC2n),\displaystyle\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)\leq\exp{\left(-cC^{\prime 2}n\right)}, (34)

so long as ZZZ\in\mathcal{E}_{Z}.

Let us define

ε,Z\displaystyle\mathcal{E}_{\varepsilon,Z} ={ε,Z(XX)1/2ε22CKεnsn(A)((1α)pCKZn) and ZZ},\displaystyle=\left\{\varepsilon,Z\mid\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\text{ and }Z\in\mathcal{E}_{Z}\right\},

then

ε,Zc\displaystyle\mathcal{E}_{\varepsilon,Z}^{c} ={ε,Z(XX)1/2ε22CKεnsn(A)((1α)pCKZn)ZZ}.\displaystyle=\left\{\varepsilon,Z\mid\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\cup Z\notin\mathcal{E}_{Z}\right\}.

Notice that

ε,Z(ε,Z)\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}\right) =𝔼ε,Z[1(XX)1/2ε22CKεnsn(A)((1α)pCKZn)1ZZ]\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[1_{\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2^{\prime}CK_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}1_{Z\in\mathcal{E}_{Z}}\right]
=𝔼Z[𝔼ε[1(XX)1/2ε22CKεnsn(A)((1α)pCKZn)1ZZ|Z]]\displaystyle=\mathbb{E}_{Z}\left[\mathbb{E}_{\varepsilon}\left[1_{\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}1_{Z\in\mathcal{E}_{Z}}\Big{|}Z\right]\right]
=𝔼Z[𝔼ε[1(XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z]1ZZ]\displaystyle=\mathbb{E}_{Z}\left[\mathbb{E}_{\varepsilon}\left[1_{\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}}\Big{|}Z\right]1_{Z\in\mathcal{E}_{Z}}\right]
=𝔼Z[ε((XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z)1ZZ]\displaystyle=\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]

which gives

ε,Z(ε,Zc)\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}^{c}\right) =1𝔼Z[ε((XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z)1ZZ]\displaystyle=1-\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]
=𝔼Z[1ε((XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z)1ZZ]\displaystyle=\mathbb{E}_{Z}\left[1-\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\leq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]
=𝔼Z[1(1ε((XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z))1ZZ]\displaystyle=\mathbb{E}_{Z}\left[1-\left(1-\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)\right)1_{Z\in\mathcal{E}_{Z}}\right]
=𝔼Z[ε((XX)1/2ε22CKεnsn(A)((1α)pCKZn)|Z)1ZZ]+1𝔼Z[1ZZ]\displaystyle=\mathbb{E}_{Z}\left[\mathbb{P}_{\varepsilon}\left(\|(XX^{\top})^{\dagger^{1/2}}\varepsilon\|_{2}\geq\frac{2C^{\prime}K_{\varepsilon}\sqrt{n}}{s_{n}(A)\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)}\Big{|}Z\right)1_{Z\in\mathcal{E}_{Z}}\right]+1-\mathbb{E}_{Z}\left[1_{Z\in\mathcal{E}_{Z}}\right]
𝔼Z[exp(cC2n)1ZZ]+(Zc)\displaystyle\leq\mathbb{E}_{Z}\left[\exp\left(-cC^{\prime 2}n\right)1_{Z\in\mathcal{E}_{Z}}\right]+\mathbb{P}\left(\mathcal{E}_{Z}^{c}\right)
exp(cC2n)Z(ZZ)1+(Zc),\displaystyle\leq\exp\left(-cC^{\prime 2}n\right)\underbrace{\mathbb{P}_{Z}\left(Z\in\mathcal{E}_{Z}\right)}_{\leq 1}+\mathbb{P}\left(\mathcal{E}_{Z}^{c}\right),

which finally yields

ε,Z(ε,Zc)\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{\varepsilon,Z}^{c}\right) exp(cC2n)+2exp(cKZαp).\displaystyle\leq\exp\left(-cC^{\prime 2}n\right)+2\exp\left(-c_{K_{Z}}\alpha p\right).

Let us turn to the study of |X(XX)ε,Xn+1||\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|. Conditioning on ε\varepsilon, we have

ε,Z,Xn+1(|X(XX)ε,Xn+1|u)\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\right) =𝔼ε,Z[Xn+1(|X(XX)ε,Xn+1|uε,Z)]\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)\right]
=𝔼ε,Z[Xn+1(|X(XX)ε,Xn+1|uε,Z)IZε,Z]\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)I_{\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}}\right]
+𝔼ε,Z[Xn+1(|X(XX)ε,Xn+1|uε,Z)1IZcε,Zc]\displaystyle\hskip 28.45274pt+\mathbb{E}_{\varepsilon,Z}\left[\underbrace{\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)}_{\leq 1}I_{\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}}\right]
=𝔼ε,Z[Xn+1(|X(XX)ε,Xn+1|uε,Z)IZε,Z]\displaystyle=\mathbb{E}_{\varepsilon,Z}\left[\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)I_{\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}}\right]
+ε,Z(Zcε,Zc).\displaystyle\hskip 28.45274pt+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right).

On the other hand, the subGaussianity of Zn+1Z_{n+1},

Xn+1(|X(XX)ε,Xn+1|uε,Z)\displaystyle\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right) =(|X(XX)ε,AZn+1|uε,Z)\displaystyle=\mathbb{P}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,AZ_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)
=(|AX(XX)ε,Zn+1|uε,Z)\displaystyle=\mathbb{P}\left(|\langle A^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon,Z_{n+1}\rangle|\geq u\mid\varepsilon,Z\right)
2exp(u22KZ2AX(XX)ε22)\displaystyle\leq 2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}\ \|A^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\|_{2}^{2}}\right)

which gives

Xn+1(|X(XX)ε,Xn+1|uε,Z)\displaystyle\mathbb{P}_{X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\mid\varepsilon,Z\right) 2exp(u22KZ2s1(A)2X(XX)ε22).\displaystyle\leq 2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}s_{1}(A)^{2}\ \|X^{\top}(XX^{\top})^{\dagger}\varepsilon\|_{2}^{2}}\right).

Then

ε,Z,Xn+1(|X(XX)ε,Xn+1|u)\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\right) 𝔼ε,Z[2exp(u22KZ2s1(A)2X(XX)ε22)IZε,Z]\displaystyle\leq\mathbb{E}_{\varepsilon,Z}\left[2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}s_{1}(A)^{2}\ \|X^{\top}(XX^{\top})^{\dagger}\varepsilon\|_{2}^{2}}\right)I_{\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}}\right]
+ε,Z(Zcε,Zc)\displaystyle\hskip 28.45274pt+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right)
2exp(u22KZ2s1(A)2sn(A)2((1α)pCKZn)24C2Kε2n)ε,Z[Zε,Z]\displaystyle\leq 2\exp\left(-\frac{u^{2}}{2\ K_{Z}^{2}s_{1}(A)^{2}}\frac{s_{n}(A)^{2}\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)^{2}}{4C^{\prime 2}K_{\varepsilon}^{2}n}\right)\mathbb{P}_{\varepsilon,Z}\left[\mathcal{E}_{Z}\cap\mathcal{E}_{\varepsilon,Z}\right]
+ε,Z(Zcε,Zc)\displaystyle+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right)
2exp(u2((1α)pCKZn)28KZ2κn(A)2C2Kε2n)+ε,Z(Zcε,Zc),\displaystyle\leq 2\exp\left(-\frac{u^{2}\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)^{2}}{8\ K_{Z}^{2}\kappa_{n}(A)^{2}C^{\prime 2}K_{\varepsilon}^{2}n}\right)+\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right),

with κn(A)=s1(A)sn(A)\kappa_{n}(A)=\frac{s_{1}(A)}{s_{n}(A)} the nn-th condition number.

And since

ε,Z(Zcε,Zc)\displaystyle\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c}\cup\mathcal{E}_{\varepsilon,Z}^{c}\right) ε,Z(Zc)+(ε,Zc)2exp(cKZαp)+exp(cC2n)+(Zc)\displaystyle\leq\mathbb{P}_{\varepsilon,Z}\left(\mathcal{E}_{Z}^{c})+\mathbb{P}(\mathcal{E}_{\varepsilon,Z}^{c}\right)\leq 2\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp{(-cC^{\prime 2}n)}+\mathbb{P}(\mathcal{E}_{Z}^{c})
4exp(cKZαp)+exp(cC2n),\displaystyle\leq 4\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp\left(-cC^{\prime 2}n\right),

we get

ε,Z,Xn+1(|X(XX)ε,Xn+1|u)2exp(u2((1α)pCKZn)28KZ2C2Kε2κn(A)2n)+4exp(cKZαp)+exp(cC2n).\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq u\right)\leq 2\exp\left(-\frac{u^{2}\left((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}n}\right)+4\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp\left(-cC^{\prime 2}n\right).

Choosing u=(n/p)log(nτ)u=\sqrt{(n/p)\log(n^{\tau})} for some τ>0\tau>0, we get

ε,Z,Xn+1(|X(XX)ε,Xn+1|(n/p)log(nτ))\displaystyle\mathbb{P}_{\varepsilon,Z,X_{n+1}}\left(|\langle X^{\top}(XX^{\top})^{\dagger}\varepsilon,X_{n+1}\rangle|\geq\sqrt{(n/p)\log(n^{\tau})}\right) 2exp(τlog(n)(1αCKZn/p)28KZ2C2Kε2κn(A)2)\displaystyle\leq 2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right) (35)
+4exp(cKZαp)+exp(cC2n).\displaystyle\hskip 56.9055pt+4\exp{\left(-c_{K_{Z}}\alpha p\right)}+\exp\left(-cC^{\prime 2}n\right). (36)

B.6 Final step of the proof

Combining (29), (31), (32), (36) and with (28), we get

|Xn+1(θ^θ)|\displaystyle|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\theta^{*})| |Xn+1(θ^θ^)|+|Xn+1(θ^θ)|\displaystyle\leq|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})|+|X_{n+1}^{\top}(\hat{\theta}-\theta^{*})|
|Xn+1(θ^θ^)|+t6CC′′CfKεnδ((1α)pCKZn)\displaystyle\leq|X_{n+1}^{\top}(\hat{\theta}^{\circ}-\hat{\theta}^{\sharp})|+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}
|Xn+1θ|+|Xn+1X(XX)ε|+|εn+1|+t6CC′′CfKεnδ((1α)pCKZn)\displaystyle\leq\left|X_{n+1}^{\top}\theta^{*}\right|+\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|+\left|\varepsilon_{n+1}\right|+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}
tKZAθ2+|Xn+1X(XX)ε|+|εn+1|+t6CC′′CfKεnδ((1α)pCKZn)\displaystyle\leq t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}+\left|X_{n+1}^{\top}X^{\top}(XX^{\top})^{\dagger}\varepsilon\right|+\left|\varepsilon_{n+1}\right|+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}
|εn+1|+tKZAθ2+(n/p)log(nτ)+t6CC′′CfKεnδ((1α)pCKZn)\displaystyle\leq\left|\varepsilon_{n+1}\right|+t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}+\sqrt{(n/p)\log(n^{\tau})}+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}

with probability at least

12exp(cKZα2p)exp(n2)exp(cKZt2)exp(t22)2exp(τlog(n)(1αCKZn/p)28KZ2C2Kε2κn(A)2)4exp(cKZαp)exp(cC2n).1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}{\color[rgb]{0,0,0}-\exp(-c_{K_{Z}}t^{2})}-\exp\left(-\frac{t^{2}}{2}\right)\\ -2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)-4\exp{\left(-c_{K_{Z}}\alpha p\right)}-\exp\left(-cC^{\prime 2}n\right). (37)

Using CfC_{f^{\prime}}-Lipschitzianity of ff, we obtain

|f(Xn+1θ^)f(Xn+1θ)|\displaystyle|f(X_{n+1}^{\top}\hat{\theta}^{\circ})-f(X_{n+1}^{\top}\theta^{*})| Cf(|εn+1|+tKZAθ2+(n/p)log(nτ)+t6CC′′CfKεnδ((1α)pCKZn))\displaystyle\leq C_{f^{\prime}}\left(\left|\varepsilon_{n+1}\right|+t\ K_{Z}\|A^{\top}\theta^{*}\|_{2}+\sqrt{(n/p)\log(n^{\tau})}+t\ \frac{6\sqrt{C}C_{\ell^{\prime\prime}}C_{f^{\prime}}K_{\varepsilon}\sqrt{n}}{\delta((1-\alpha)\sqrt{p}-C_{K_{Z}}\sqrt{n})}\right)

with probability at least

12exp(cKZα2p)exp(n2)exp(cKZt2)exp(t22)2exp(τlog(n)(1αCKZn/p)28KZ2C2Kε2κn(A)2)4exp(cKZαp)exp(cC2n).1-2\exp{(-c_{K_{Z}}\alpha^{2}p)}-\exp{\left(-\frac{n}{2}\right)}{\color[rgb]{0,0,0}-\exp(-c_{K_{Z}}t^{2})}-\exp\left(-\frac{t^{2}}{2}\right)\\ -2\exp\left(-\frac{\tau\log(n)\left(1-\alpha-C_{K_{Z}}\sqrt{n/p}\right)^{2}}{8\ K_{Z}^{2}C^{\prime 2}K_{\varepsilon}^{2}\kappa_{n}(A)^{2}}\right)-4\exp{\left(-c_{K_{Z}}\alpha p\right)}-\exp\left(-cC^{\prime 2}n\right). (38)

which is the desired result.

Appendix C Classical bounds on the extreme singular values of finite dimensional random matrices

C.1 Random matrices with independent rows

Recall that the matrix XX is composed by nn i.i.d. subGaussian random vectors in p\mathbb{R}^{p}, with KX=maxiXiψ2K_{X}=\max_{i}\|X_{i}\|_{\psi_{2}}. In the underparametrised case, we have n>pn>p. Let us recall the following bound on the singular values of a matrix with independent subGaussian rows.

Theorem C.1.

[17, Theorem 5.39] Let XX be an n×pn\times p matrix whose rows XiX_{i}, i=1,,ni=1,\ldots,n are independent subGaussian isotropic random vectors in p\mathbb{R}^{p}. Then for every t0t\geq 0, with probability at least 12exp(cKXt2)1-2\exp(-c_{K_{X}}t^{2}), one has

nCKXptsn(X)s1(X)n+CKXp+t\sqrt{n}-C_{K_{X}}\sqrt{p}-t\leq s_{n}(X)\leq s_{1}(X)\leq\sqrt{n}+C_{K_{X}}\sqrt{p}+t

where CKX,cKX>0C_{K_{X}},c_{K_{X}}>0 depend only on the subGaussian norm KX=maxiXiψ2K_{X}=\max_{i}\|X_{i}\|_{\psi_{2}} of the rows.

In the our main text, we use the corollary

Corollary C.2.

Let us suppose that tαnt\geq\alpha\sqrt{n} with α>0\alpha>0. Using the same assumptions as in Theorem C.1, then with probability equal or larger than 12exp(cKXα2n)1-2\exp(-c_{K_{X}}\alpha^{2}n)

sn(X)(1α)nCKXp,\displaystyle s_{n}(X)\geq(1-\alpha)\sqrt{n}-C_{K_{X}}\sqrt{p},
s1(X)(1+α)n+CKXp.\displaystyle s_{1}(X)\leq(1+\alpha)\sqrt{n}+C_{K_{X}}\sqrt{p}.

C.2 Random matrices with independent columns

In the overparametrised case, the following theorem of Vershynin will be instrumental.

Theorem C.3.

[17, Theorem 5.58] Let XX be an n×pn\times p matrix with npn\leq p whose rows XiX_{i} are independent subGaussian isotropic random vectors in p\mathbb{R}^{p} with Xi2=p\|X_{i}\|_{2}=\sqrt{p} a.s. Then for every t0t\geq 0, with probability at least 12exp(cKXαp)1-2\exp(-c_{K_{X}}\alpha\ p), one has

sn(X)(1α)pCKXn,\displaystyle s_{n}(X^{\top})\geq(1-\alpha)\sqrt{p}-C_{K_{X}}\sqrt{n},
s1(X)(1+α)p+CKXn,\displaystyle s_{1}(X^{\top})\leq(1+\alpha)\sqrt{p}+C_{K_{X}}\sqrt{n},

where CKX,cKX>0C_{K_{X}},c_{K_{X}}>0 depend only on the subGaussian norm KX=maxiXiψ2K_{X}=\max_{i}\|X_{i}\|_{\psi_{2}} of the columns.