A finite sample analysis of generalisation for minimum norm interpolators in the ridge function model with random design
Abstract
Recent extensive numerical experiments in high scale machine learning have allowed to uncover a quite counterintuitive phase transition, as a function of the ratio between the sample size and the number of parameters in the model. As the number of parameters approaches the sample size , the generalisation error increases, but surprisingly, it starts decreasing again past the threshold . This phenomenon, brought to the theoretical community attention in [2], has been thoroughly investigated lately, more specifically for simpler models than deep neural networks, such as the linear model when the parameter is taken to be the minimum norm solution to the least-squares problem, firstly in the asymptotic regime when and tend to infinity, see e.g. [8], and recently in the finite dimensional regime and more specifically for linear models [1], [16], [9]. In the present paper, we propose a finite sample analysis of non-linear models of ridge type, where we investigate the overparametrised regime of the double descent phenomenon for both the estimation problem and the prediction problem. Our results provide a precise analysis of the distance of the best estimator from the true parameter as well as a generalisation bound which complements recent works of [1] and [6]. Our analysis is based on tools closely related to the continuous Newton method [14] and a refined quantitative analysis of the performance in prediction of the minimum -norm solution.
1 Introduction
The tremendous achievements in deep learning theory and practice have received great attention in the applied Computer Science, Artificial Intelligence and Statistics communities in the recent years. Many success stories related to the use of Deep Neural Networks have even been reported in the media and most data scientists are proficient with the Deep Learning tools available via opensource machine learning libraries such as Tensorflow, Keras, Pytorch and many others.
One of the key ingredient in their success is the huge number of parameters involved in all current architectures. While enabling unprecedented expressivity capabilities, such regimes of overparametrisations appear very counterintuitive through the lens of traditional statistical wisdom. Indeed, as intuition suggests, overparametrisation often results in interpolation, i.e. zero training error. The expected outcome of such endeavors should result in very poor generalisation performance. Nevertheless interpolating deep neural networks still generalise well despite achieving zero or very small training error.
Belkin et al. [2] recently addressed the problem of resolving this paradox, and brought some new light on the complex and perplexing relationships between interpolation and generalization. In the linear model setting, in the regime where the weights are the minimum norm solution to the least-squares problem, overfitting was proved to be benign under strong design assumptions (i.i.d.) random matrices, when and tend to infinity, see e.g. [8]. More precise results were recently obtained in the finite dimensional regime and more specifically for linear models [1], [16], [9]. Non-linear settings have also been considered such as in the particular instance of kernel ridge regression as studied in [3], [12], [10] proved that interpolation can coexist with good generalization. More recent results in the nonlinear direction are, e.g. [7] for shallow neural networks using a very general framework that combines the properties of gradient descent for a binary classification problem, but somehow restrictive assumptions on the dimension of the input of the form . In this paper, we consider a statistical model where , i.e. a model of the form
(1) |
where and the function is assumed increasing. This setting is also known as the single index model and rigde function model. When the estimation of is performed using Empirical Risk Estimation, i.e. by solving
(2) |
for a given smooth loss function , we derive an upper bound on the prediction error that gives precise order of dependencies with respect to all the intrinsic parameters of the model, such as the dimensions and , as well as various bounds on the derivatives of and of the loss function used in the Empirical Risk Estimation.
Our contribution improves on the literature on the non-asymptotic regime of benign overfitting for non-linear models by providing concentration results on generalisation instead of controlling the expected risk only. More precisely, our results quantify the proximity in of a certain solution of (2) to . Our proof of this first result utilises an elegant continuous Newton method argument initially promoted by Neuberger [5], [14]. From this, we obtain a quantitative analysis of the performance in prediction of the minimum -norm solution based on a careful study relying on high dimensional probability.
2 Distance of the true model to the set of empirical risk minimisers
We study a non-isotropic model for the rows of the design matrix, and how the Cholesky factorisation of the covariance matrix can be leveraged to improve the prediction results. Let us now describe our mathematical model.
2.1 Statistical model
In this section, we make general assumptions on the feature vectors.
Assumption 1.
We assume that (1) holds and that and the loss function : satisfy the following properties
-
•
is strictly monotonic,
-
•
for some positive constant ,
-
•
and is upper bounded by a constant .
Concerning the statistical data, we will make the following set of assumptions
Assumption 2.
We will require that
-
•
the random column vectors with values in are independent random vectors, which can be written as for some matrix in , with at least non zero singular values, and some independent subGaussian vectors in , with -norm upper bounded by . The vectors are such that the matrix
is full rank with probability one. We define ;
-
•
for all , the random vectors are assumed
-
–
to have a second moment matrix equal to the identity, i.e.
-
–
to have -norm equal111notice that this is different from the usual regression model, where the columns are assumed to be normalised. to .
-
–
Assumption 3.
The errors are assumed to be independent subGaussian centered random variables with -norm upper bounded by .
The performance of the estimators is measured by the theoretical risk by
In order to estimate , the Empirical Risk Minimizer is defined as a solution to the following optimisation problem
(3) |
with
(4) |
Let us make an important assumption for what is to follow
Assumption 4.
Assume that the loss function and the function are such that
(5) |
for all in , which is the -ball of radius centered at .
Let us check Assumption 4 in various simple cases.
-
•
In the case of linear regression, we have , , and , , . Thus, we get .
-
•
In the case of the Huber loss and the Softplus function, we have
, ,
and , and .
Moreover, . Then, when , we can take as long as for all and all .
Notation : We denote by the -th singular value of the matrix , in decreasing order, i.e. . When is non-zero, we denote by the n-th condition number of .
2.2 Distance to empirical risk minimisers via Neuberger’s theorem
Our main result in this section is the following Theorem. Their proofs are given in Section A.2 and Section A.3 in the appendix.
Theorem 2.1.
(Underparametrised setting)
Let and be positive constants depending only on the subGaussian norm . Let be a real in . Assume that and are such that and that Assumption 4 is verified. Let
(6) |
where , with a positive absolute constant.
Then, with probability at least , the unique solution to the optimisation problem (3) satisfies
where is the usual Euclidean norm.
Theorem 2.2.
(Overparametrised setting)
Remark 2.3.
Assume that (hence ) for simplicity. In order to illustrate the previous results, consider the case where the noise level is of the same order as , i.e. , and is independent of , as in the linear case. In the underparametrised case, this implies that the error bound on grows like . In the overparametrised case, the error bound given by Theorem 2.2 is of order . This is potentially much smaller than which is the natural order for in the absence of sparsity.
Note also that in the different setup of Candès and Plan [4], the rows of would have norm of the order of and the noise would have subgaussian constant independent of the dimensions and . Multiplying by , we get a norm of the ’s of the order and a subgaussian constant of the order for the noise. With this parametrisation of in mind, in the underparametrised case, the bound is of the order of and in the overparametrised case, the error bound given by Theorem 2.2 would be of the order .
2.3 The case of the linear models
2.3.1 The case of linear regression
Let us apply these theorems on the well-known linear regression model. In the linear case , the loss is quadratic and the optimisation problem (3) is
We therefore have:
Corollary 2.4 (Underparametrised case: linear model).
2.4 Discussion of the results
Our results are based on a new zero finding approach inspired from [14] and we obtain precise quantitative results in the finite sample setting for linear and non-linear models. Following the initial discovery of the ”double descent phenomenon” in [2], many authors have addressed the question of precisely characterising the error decay as a function of the number of parameters in the linear and non-linear setting (mostly based on random feature models). Some of the latest works, such as [11], address the problem in the asymptotic regime. Our results provide an explicit control of the distance between some empirical estimators and the ground truth in terms of the subGaussian norms of the error and the covariance matrix of the covariate vectors. Our analysis employs efficient techniques from [14], combined with various results from Random Matrix Theory and high dimensional probability, as summarized in [17] and extenstively discussed in [18]. Theorem 2.1 and Theorem 2.2 provide a new finite sample analysis of the problem of estimating ridge functions in both underparametrised and overparametrised regimes, i.e. where the number of parameters is smaller (resp. larger) than the sample size.
-
•
Our analysis of the underparametrised setting shows that we can obtain an error of order less than or equal to for all up to an arbitrary fraction of .
-
•
In the overparametrised setting, we get that the error bound is approaching the noise level plus as grows to . This term can be small if the rank of is small, and even more so if is concentrated in the null space of or approximatively so, which is reminiscent of the results in [16] and the recent work [9].
Similar but simpler bounds also hold in the linear model setting, as presented in Corollary 2.4 and Corollary 2.5.
The following section addresses the problem of studying the prediction error in probability of the least -norm empirical risk minimiser in the overparamatrised setting.
3 Generalisation of least norm empirical risk minimisers
In this section, we leverage the results of the former section in order to study the prediction error of the least -norm empirical risk minimiser. Our main result is Theorem 3.1 from the following section.
3.1 Main result
Using the previous results, we obtain the following theorem about benign overfitting in the overparametrised case.
Theorem 3.1.
Assume that is interpolating, i.e. . Let denote the minimum norm interpolating solution to the ERM problem, i.e.
(9) |
Under the same assumptions as in Theorem 2.2, for any constant , and for
(10) |
with , where is a positive absolute constant, we have
(11) |
with probability at least
where and are absolute positive constants.
Proof.
See Section B. ∎
Remark 3.2.
For isotropic data, take . Notice that, in accordance with the concept of double descent as discussed in [2], when grows (while staying smaller than in the overparametrised regime), the ratio approaches from the right and the upper bound worsens.
3.2 Comparison with previous results
Recently, the finite sample analysis has been addressed in the very interesting works [1], [6] and [9] for the linear model only. These works propose very precise upper and lower bounds on the prediction risk for general covariate covariance matrices under the subGaussian assumption or the more restrictive Gaussian assumption. In [1], excess risk is considered and the norm of appears in the proposed upper bound instead of , which is potentially much smaller if has a large component in the kernel of or in spaces where the singular values of are small. The same issue arises in [6], where the noise is not assumed Gaussian nor subGaussian and can be dependent on the design matrix . In [9], the authors propose a very strong bound that incorporates information that is more general than . However the setting studied in [9] is restricted to linear models and Gaussian observation noise. Studies on nonlinear models are scarce and a notable exception is [7], where the setting is very different from ours and the number of parameters is constrained in a more rigid fashion than in the present work.
In the present work, we show that similar, very precise results can be obtained for non-linear models of the ridge function class, using elementary perturbation results and some (now) standard random matrix theory. Moreover, our results concern the probability of exceeding a certain error level in predicting a new data point, whereas most results in the literature are restricted to the expected risk.
4 Conclusion and perspectives
This work presents a precise quantitative, finite sample analysis of the benign overfitting phenomenon in the estimation of linear and certain non-linear models. We make use of a zero-finding result of Neuberger [14] which can be applied to a large number of settings in machine learning.
Extending our work to the case of Deep Neural Networks is an exciting avenue for future research. Another possible direction is to include penalisation, which can be addressed using the same techniques via Karush-Kuhn-Tucker conditions. Applying this approach to Ridge Regression and -penalised estimation would bring new proof techniques into the field to investigate potentially deeper results. Weakening the assumptions on our data, which are here of subGaussian type, could also lead to interesting new results; this could be achieved by utilising, e.g. the work of [13].
References
- [1] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
- [2] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- [3] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
- [4] Emmanuel J Candès, Yaniv Plan, et al. Near-ideal model selection by minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
- [5] Alfonso Castro and JW Neuberger. An inverse function theorem via continuous newton’s method. 2001.
- [6] Geoffrey Chinot and Matthieu Lerasle. Benign overfitting in the large deviation regime. arXiv preprint arXiv:2003.05838, 2020.
- [7] Spencer Frei, Niladri S Chatterji, and Peter Bartlett. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Conference on Learning Theory, pages 2668–2703. PMLR, 2022.
- [8] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- [9] Guillaume Lecué and Zong Shang. A geometrical viewpoint on the benign overfitting property of the minimum -norm interpolant estimator. arXiv preprint arXiv:2203.05873, 2022.
- [10] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel” ridgeless” regression can generalize. arXiv preprint arXiv:1808.00387, 2018.
- [11] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
- [12] Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
- [13] Shahar Mendelson. Extending the small-ball method. arXiv preprint arXiv:1709.00843, 2017.
- [14] John W Neuberger. The continuous newton’s method, inverse functions, and nash-moser. The American Mathematical Monthly, 114(5):432–437, 2007.
- [15] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics.
- [16] Alexander Tsigler and Peter L Bartlett. Benign overfitting in ridge regression. arXiv preprint arXiv:2009.14286, 2020.
- [17] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- [18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
Appendix A Proofs of Theorem 2.1 and Theorem 2.2
A.1 Neuberger’s theorem
The following theorem of Neuberger [14] will be instrumental in our study of the ERM. In our context, this theorem can be restated as follows.
Theorem A.1 (Neuberger’s theorem for ERM).
Suppose that , that and that the Jacobian is a continuous map on , with the property that for each in there exists a vector in such that,
(12) |
Then there exists in such that .
A.1.1 Computing the derivatives
Since the loss is twice differentiable, the empirical risk is itself twice differentiable. The Gradient of the empirical risk is given by
(13) |
Hence, for , and recalling that , we get
(14) |
where is to be understood componentwise, and is a diagonal matrix with coefficients for all in .
The Hessian is given by
(15) |
where is a diagonal matrix given by, for all in
Notice that when the are all non-negative for all , the empirical risk is convex. The condition we have to satisfy in order to use Neuberger’s theorem, i.e. the version of (12) associated with our setting, is the following
A.1.2 A change of variable
Since
(16) |
and , the risk can be written as a function of as follows:
(17) |
with . Clearly, minimizing in is equivalent to minimizing in up to equivalence modulo the kernel of .
(18) | ||||
(19) |
(20) |
where is a diagonal matrix given by, for all in
The Neuberger equation in is now given by
(21) |
where .
A.1.3 A technical lemma
Lemma A.2.
For all , the variable is subGaussian with variance proxy
Proof.
Let us compute
Lipschitzianity of implies that
and since , we get
which implies that
Thus,
as announced. The proof is complete. ∎
A.2 Proof of Theorem 2.1: The underparametrised case
A.2.1 Key lemma
Using Corollary C.2, we have as long as
for some positive constant depending on the subGaussian norm of .
Lemma A.3.
With probability larger than or equal to we have
where .
Proof.
We compute the solution of Neuberger’s equation (21), using the Jacobian vector formula in (19) and the Hessian matrix formula in (20):
which gives
The singular value decomposition of gives , where is a matrix whose columns form an orthonormal family, is an orthogonal matrix and is a diagonal and invertible matrix. Thus, we obtain the equivalent equation
Note that any solution of the system
will be admissible. Hence,
Then, as ,
(22) |
Using maximal inequalities, we can now bound the deviation probability of
with . For this purpose, we first prove that is a subGaussian vector and provide an explicit upper bound on its norm. Since is a matrix whose columns form an orthonormal family, is a unit-norm vector of size which is denoted by . Then,
and since is centered and subGaussian for all , Vershynin’s [18, Proposition 2.6.1] gives
for some absolute constant . Since
we have
As , we get that for all with ,
We deduce that is a subGaussian random vector with variance proxy
Using the maximal inequality from [15, Theorem 1.19] and the subGaussian properties described in [18, Proposition 2.5.2], for any , we get with probability
(23) |
Following Lemma A.2, we deduce that is a subGaussian random variable with variance proxy
Then, restricting a priori to lie in the ball , which is coherent with the assumptions of Newberger’s Theorem A.1, and using the assumption 4, together with the boundedness of , we get
Subsequently, the quantity (23) becomes
with probability . Taking , equation (22) yields
with probability . ∎
A.2.2 End of the proof of Theorem 2.1
A.3 Proof of Theorem 2.2: The overparametrised case
A.3.1 Key lemma
Using Theorem C.3, we have as long as
for some positive constant depending on the subGaussian norm of .
Lemma A.4.
With probability larger than or equal to , we have
where .
Proof.
As in the underparametrised case, we have to solve (12) i.e
which can be solved by finding the least norm solution of the interpolation problem
(24) |
i.e.
where is the Moore-Penrose pseudo-inverse of .
Given the compact SVD of , where and with orthonormal columns, we get
We then have
As for all ,
Then, using the bound
with probability , which can be obtained in the same way as (23) for the underparametrised case, and taking , we get
(25) |
with probability . ∎
A.3.2 End of the proof of Theorem 2.2
Appendix B Proof of Theorem 3.1
B.1 First part of the proof
Since and are two interpolating minimisers of the empirical risk, i.e. satisfy and for all in , then using that is strictly monotonic, we get
(27) |
B.2 Second part of the proof
B.3 Third part of the proof: Block pseudo-inverse computation
Since is the minimum -norm solution to the under-determined system
then we have
This gives
(30) |
since
and
B.4 Fourth part of the proof
B.5 Fifth part of the proof
In this section, we control of each term in the RHS of (31).
Control of
We have
(32) |
Control of
Let us compute an upper bound on the norm of which holds with large probability. We have
Our main tool will be the following result from Vershynin’s book [18, Exercise 6.3.5]
(33) |
where and are absolute positive constants.
Let us introduce the Singular Value Decomposition of
with is a matrix, where is the rank of . We then have
Then we have
and since for some matrix :
where denotes the largest singular value. Since the columns of form an othornomal family, we get
Hence
Since is symmetric, we have
Define the event
Let us define
then
Notice that
which gives
which finally yields
Let us turn to the study of . Conditioning on , we have
On the other hand, the subGaussianity of ,
which gives
Then
with the -th condition number.
And since
we get
Choosing for some , we get
(35) | ||||
(36) |
B.6 Final step of the proof
with probability at least
(37) |
Using -Lipschitzianity of , we obtain
with probability at least
(38) |
which is the desired result.
Appendix C Classical bounds on the extreme singular values of finite dimensional random matrices
C.1 Random matrices with independent rows
Recall that the matrix is composed by i.i.d. subGaussian random vectors in , with . In the underparametrised case, we have . Let us recall the following bound on the singular values of a matrix with independent subGaussian rows.
Theorem C.1.
[17, Theorem 5.39] Let be an matrix whose rows , are independent subGaussian isotropic random vectors in . Then for every , with probability at least , one has
where depend only on the subGaussian norm of the rows.
In the our main text, we use the corollary
Corollary C.2.
Let us suppose that with . Using the same assumptions as in Theorem C.1, then with probability equal or larger than
C.2 Random matrices with independent columns
In the overparametrised case, the following theorem of Vershynin will be instrumental.
Theorem C.3.
[17, Theorem 5.58] Let be an matrix with whose rows are independent subGaussian isotropic random vectors in with a.s. Then for every , with probability at least , one has
where depend only on the subGaussian norm of the columns.