Optimal Excess Risk Bounds for Empirical Risk Minimization on -Norm Linear Regression
Abstract
We study the performance of empirical risk minimization on the -norm linear regression problem for . We show that, in the realizable case, under no moment assumptions, and up to a distribution-dependent constant, samples are enough to exactly recover the target. Otherwise, for , and under weak moment assumptions on the target and the covariates, we prove a high probability excess risk bound on the empirical risk minimizer whose leading term matches, up to a constant that depends only on , the asymptotically exact rate. We extend this result to the case under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.
1 Introduction
Real-valued linear prediction is a fundamental problem in machine learning. Traditionally, the square loss has been the default choice for this problem. The performance of empirical risk minimization (ERM) on linear regression under the square loss, as measured by the excess risk, has been studied extensively both from an asymptotic [Whi82, LC83, Vaa98] and a non-asymptotic point of view [AC11, HKZ12, Oli16, LM16, Sau18, Mou22]. An achievement of the last decade has been the development of non-asymptotic excess risk bounds for ERM on this problem under weak assumptions, and which match, up to constant factors, the asymptotically exact rate.
In this paper, we consider the more general family of -th power losses for a user-chosen . Under mild assumptions, the classical asymptotic theory can still be applied to ERM under these losses, yielding the asymptotic distribution of the excess risk. However, to the best of our knowledge, the problem of deriving non-asymptotic excess risk bounds for ERM for remains open, and, as we discuss below, resists the application of standard tools from the literature.
Our motivation for extending the case to is twofold. Firstly, the freedom in the choice of allows us to better capture our prediction goals. For example, we might only care about how accurate our prediction is on average, in which case, the choice is appropriate. At the other extreme, we might insist that we do as well as possible on a subset of inputs of probability , in which case the choice is best. A choice of therefore allows us to interpolate between these two extremes, with the case offering a balanced choice. Secondly, different choices of have complementary qualities. On the one hand, small values of allow us to operate with weak moment assumptions, making them applicable in more general cases. On the other, larger values of yield predictions whose optimality is less sensitive to changes in the underlying distribution: for , the best predictor depends only on the support of this distribution.
To sharpen our discussion, let us briefly formalize our problem. There is an input random vector and output random variable , and we are provided with i.i.d. samples . We select our set of predictors to be the class of linear functions , and choose a value with the corresponding loss . Using this loss, we define the associated risk and empirical risk, respectively, by
We perform empirical risk minimization , and our goal is to derive a high probability bound on the excess risk , where is the risk minimizer. For efficient algorithms for computing an empirical risk minimizer , we refer the reader to the rich recent literature dealing with this problem [BCLL18, AKPS19, APS19, JLS22].
To see why the problem we are considering is difficult, let us briefly review some of the recent literature. Most closely related to our problem are the results of [AC11, HKZ12, Oli16, LM16], who derive high probability non-asymptotic excess risk bounds for the case . The best such bounds are found in [Oli16] and [LM16], who both operate under weak assumptions on , requiring at most the existence of fourth moments of and the components of for . Unfortunately, the analysis in [Oli16] relies on the closed form expression of the empirical risk minimizer , and therefore cannot be extended to other values of . Similarly, the analysis in [LM16] relies on an exact decomposition of the excess loss in terms of “quadratic” and “multiplier” components, which also does not extend to other values of .
To address these limitations, the work of [Men18] extends the ideas of [Men14, LM16] to work for loss functions more general than the square loss. Roughly speaking, the main result of [Men18] states that as long as the loss is strongly convex and smooth in a neighbourhood of , the techniques developed by [Men14] can still be applied to obtain high probability excess risk bounds. Unfortunately, the loss functions are particularly ill-behaved in precisely this sense, as when for , and as for . This makes the analysis of the excess risk of ERM in the case particularly challenging using well-established methods.
Contrary to the non-asymptotic regime, the asymptotic properties of the excess risk of ERM under the losses are better understood [Ron84, BRW92, Nie92, Arc96, HS96, LL05], and can be derived from the more general classical asymptotic theory of -estimators [LC83, VW96, Vaa98] under mild regularity conditions. In particular, these asymptotic results imply that the excess risk of ERM with samples satisfies
(1) |
where is the Hessian of the risk at its minimizer. We refer the reader to the discussions in [OB21, MG22] for more details. As we demonstrate in Theorem 1, the rate of convergence of ERM for the square loss derived in [Oli16] and [LM16] matches the asymptotic rate (1) up to a constant factor. Ideally, we would like our excess risk bounds for the cases to also match the asymptotic rate (1), although it is not yet clear how to derive any meaningful such bounds.
In this paper, we prove the first high probability non-asymptotic excess risk bounds for ERM under the -th power losses for any . Our assumptions on are weak, arise naturally from the analysis, and reduce to the standard ones for the case . Furthermore, the rate we derive matches, up to a constant that depends only on , the asymptotically exact rate (1).
We split the analysis in three cases. The first is when the problem is realizable, i.e. for some . This edge case is not problematic for the analysis of the case , but as discussed above, the losses are ill-behaved around for , requiring us to treat this case separately. The second case is when the problem is not realizable and . The final case is when the problem is not realizable and , which turns out to be the most technically challenging. In Section 2, we present our main results and in Section 3, we provide their proofs.
Notation. We denote the components of the random vector by for . We assume the support of is not contained in any hyperplane, i.e. only if . This is without loss of generality as discussed in [Oli16, Mou22]. For a positive semi-definite matrix , we denote the bilinear form it induces on by , and define . We define .
2 Main results
In this section, we state our main results. We start in Section 2.1 where we introduce constants that help us formulate our theorems. In Section 2.2, we state the best known results for both the case and the realizable case where . Finally, in Section 2.3, we state our theorems.
2.1 Norm equivalence and small ball constants
To state our results, we will need to define two types of quantities first. The first kind are related to norms and their equivalence constants, which we will use in the analysis of the non-realizable case. The second are small ball probabilities, which we will use in the analysis of the realizable case.
We start by introducing the following functions on our space of coefficients . For , define, with the convention for all ,
(2) |
As suggested by the notation, under appropriate moment assumptions on , these functions are indeed norms on . In that case, we will be interested in norm equivalence constants between them
(3) |
where and stand for one of or . Let us note that since we work in a finite dimensional vector space, all norms are equivalent, so that as soon as the quantities defined in (2) are indeed norms, the constants defined in (3) are finite. Furthermore, as suggested by the notation, may be viewed as the maximum second moment of the random variables over the unit sphere of . Finally, we record the following identities for future use
(4) |
The first identity holds by linearity, and the second by noticing that .
We now turn to small ball probabilities. We define the following functions on , for ,
(5) |
Assumptions on the functions and have been used extensively in the recent literature, see e.g. [Men14, KM15, LM17, LM17a, Men18, LM18, Mou22]. In particular, a standard assumption postulates the existence of strictly positive constants , and such that and for all . Conditions of this type are usually referred to as small ball conditions. Efforts have been made to understand when these conditions hold [Men14, RV15, LM17a] as well as reveal the dimension dependence of the constants with which they do [Sau18]. Here we prove that such conditions always hold for finite dimensional spaces. We leave the proof of Lemma 1 to Appendix B to not distract from our main development.
Lemma 1.
is upper semi-continuous. Furthermore, if for some , for all , then is lower semi-continuous for any . Moreover, for all
2.2 Background
To better contextualize our results, we start by stating the best known high probability bound on ERM for the square loss, which we deduce from [Oli16] and [LM16].
Theorem 1 (Theorem 4.2, [Oli16]; Theorem 1.3, [LM16]).
Assume that and for all , and let . If
then, with probability at least
Up to a constant factor and the dependence on , Theorem 1 recovers the asymptotically exact rate (1). Let us briefly comment on the differences between Theorem 1 and the comparable statements in the original papers. First, the finiteness of is deduced from the finiteness of the fourth moments of the components of , instead of being assumed as in [Oli16] (see the discussion in Section 3.1 in [Oli16]). Second we combine Theorem 3.1 from [Oli16] with the proof technique of [LM16] to achieve a slightly better bound that the one achieved by the proof technique used in the proof of Theorem 4.2 in [Oli16], while avoiding the dependence on the small ball-constant present in the bound of Theorem 1.3 in [LM16], which is known to incur additional dimension dependence in some cases [Sau18].
We now move to the realizable case, where so that for all . We immediately note that Theorem 1 is still applicable in this case, and ensures that we recover exactly with no more than samples. However, we can do much better, while getting rid of all the moment assumptions in Theorem 1. Indeed, it is not hard to see that only if for some , for all (taking works). The implicit argument in Theorem 1 then uses the pointwise bound (see Lemma B.2 in [Oli16])
and uniformizes it over the unit sphere in , where the norm is as defined in (2). However, we can use the much tighter bound where is as defined in Lemma 1. To the best of our knowledge, the realizable case has not been studied explicitly before in the literature. However, with the above considerations in mind, we can deduce the following result from [LM17a], which uniformizes the pointwise bound we just discussed using a VC dimension argument.
Theorem 2 (Corollary 2.5, [LM17a]).
Assume that there exists such that . Let . If
then for any , with probability at least .
2.3 Results
We are now in position to state our main results. As discussed in Section 1, the losses have degenerate second derivatives as . When the problem is realizable, the risk is not twice differentiable at its minimizer for the cases , and is degenerate for the cases . If we want bounds of the form (1), we must exclude this case from our analysis. Our first main result is a strengthening of Theorem 2, and relies on a combinatorial argument to uniformize the pointwise estimate discussed in Section 2.2.
Theorem 3.
Assume that there exists such that . Then for all , and for all , we have
where is as defined in Lemma 1. Furthermore, if
then with probability at least , .
Comparing Theorem 2 and Theorem 3, we see that the bound on the number of samples required to reach a confidence level in Theorem 3 is uniformly smaller than the one in Theorem 2. The proof of Theorem 3 can be found in Appendix C.
We now move to the more common non-realizable case. Our first theorem here gives a non-asymptotic bound for the excess risk of ERM under a -th power loss for . To the best of our knowledge, no such result is known in the literature.
Theorem 4.
Let and . Assume that no satisfies . Further, assume that , , and for all . If
then with probability at least
where we used to denote as defined in (3).
Up to a constant factor that depends only on and the dependence on , the bound of Theorem 4 is precisely of the form of the optimal bound (1). Indeed, as , the second term is . At the level of assumptions, the finiteness of the -th moment of and the components of is necessary to ensure that the risk is finite for all . The last assumption is a natural extension of the fourth moment assumption in Theorem 1. In fact, all three assumptions in Theorem 4 reduce to those of Theorem 1 as . It is worth noting that the constant has the alternative expression by (4), i.e. it is the norm equivalence constant between the norm and the norm induced by . Using again (4), we see that as . As , grows, and we suspect in a dimension dependent way. However, this does not affect the asymptotic optimality of our rate as only enters an term in our bound.
We now turn to the case of where we need a slightly stronger version of non-realizability.
Theorem 5.
Let and . Assume that and . Further, assume that , , for all . If
then, with probability at least
where we used to denote and where .
Just like the bounds of Theorems 1 and 4, the bound of Theorem 5 is asymptotically optimal up to a constant factor that depends only on . Indeed, since , , and the second term is . At the level of assumptions, we have two additional conditions compared to Theorem 4. First, we require the existence of the second moment of the covariates instead of just the -th moment. Second, we require a stronger version of non-realizability by assuming the existence of the negative moment of . In the majority of applications, an intercept variable is included as a covariate, i.e. , so that this negative moment assumption is already implied by the standard assumption . In the rare case where an intercept variable is not included, any negative moment assumption on can be used instead, at the cost of a larger factor in the term.
Finally, it is worth noting that for the cases , there are situations where the asymptotic bound (1) does not hold, as the limiting distribution of the coefficients as does not necessarily converge to a Gaussian, and depends heavily on the distribution of , see e.g. [LL05] and [Kni98]. Overall, we suspect that perhaps a slightly weaker version of our assumptions is necessary for a fast rate like (1) to hold.
3 Proofs
3.1 Proof of Theorem 1
Here we give a detailed proof of Theorem 1. While the core technical result can be deduced by combining results from [Oli16] and [LM16], here we frame the proof in a way that makes it easy to extend to the cases , and differently from either paper. We split the proof in three steps. First notice that since the loss is a quadratic function of , we can express it exactly using a second order Taylor expansion around the minimizer
Taking empirical averages and expectations of both sides respectively shows that the excess empirical risk and excess risk also admit such an expansion
(6) |
where in the second equality we used that the gradient of the risk vanishes at the minimizer . Therefore, to bound the excess risk, it is sufficient to bound the norm . This is the goal of the second step, where we use two ideas. First, by definition, the excess empirical risk of the empirical risk minimizer satisfies the upper bound
(7) |
Second, we use the Cauchy-Schwartz inequality to lower bound the excess empirical risk by
(8) |
and we further lower bound it by deriving high probability bounds on the two random terms and . The first can easily be bounded using Chebyshev’s inequality and the elementary fact that the variance of the average of i.i.d. random variables is the variance of their common distribution divided by . Here we state the result for all ; the straightforward proof can be found in the Appendix D.
Lemma 2.
Let . If , let the assumptions of Theorem 5 hold. Then with probability at least
For the second random term , we use Theorem 3.1 of [Oli16], which we restate here, emphasizing that the existence of fourth moments of the components of the random vector is enough to ensure the existence of the needed norm equivalence constant.
Proposition 1 (Theorem 3.1, [Oli16]).
Let be a random vector satisfying for all and assume that only if . For and , define
Let be i.i.d. samples of . Then, with probability at least , for all ,
Using this result we can immediately deduce the required high probability lower bound on the second random term ; we leave the obvious proof to Appendix D.
Corollary 1.
Under the assumptions of Theorem 1, if , then with probability at least , for all ,
3.2 Proof Sketch of Theorem 4
The main challenge in moving from the case to the case is that the second order Taylor expansion of the loss is no longer exact. The standard way to deal with this problem is to assume that the loss is upper and lower bounded by quadratic functions, i.e. that it is smooth and strongly convex. Unfortunately, as discussed in Section 1, the loss is not strongly convex for any , so we need to find another way to deal with this issue. Once this has been resolved however, the strategy we used in the proof of Theorem 1 can be applied almost verbatim to yield the result. Remarkably, a result of [AKPS22] allows us to upper and lower bound the -th power loss for by its second order Taylor expansion around a point, up to some residual terms. An application of this result yields the following Lemma.
Lemma 3.
Let . Then:
(10) | ||||
(11) |
Up to constant factors that depend only on and an norm residual term, Lemma 3 gives matching upper and lower bounds on the excess risk and excess empirical risk in terms of their second order Taylor expansions around the minimizer. We can thus use the approach taken in the proof of Theorem 1 to obtain the result. The only additional challenge is the control of the term , which we achieve by reducing it to an term using norm equivalence. A detailed proof of Theorem 4, including the proof of Lemma 3, can be found in Appendix E.
3.3 Proof Sketch of Theorem 5
The most technically challenging case is when . Indeed as seen in the proof of Theorem 1, the most involved step is lower bounding the excess empirical risk with high probability. For the case , we achieved this by having access to a pointwise quadratic lower bound, which is not too surprising. Indeed, at small scales, we expect the second order Taylor expansion to be accurate, while at large scales, we expect the -th power loss to grow at least quadratically for .
In the case of , we are faced with a harder problem. Indeed, as , the losses behave almost linearly at large scales. This means that we cannot expect to obtain a global quadratic lower bound as for the case , so we will need a different proof technique. Motivated by related concerns, [BCLL18] introduced the following approximation to the -th power function
for and with . This function was further studied by [AKPS19], who showed in particular that for any , the function is, up to constants that depend only on , equal to the gap between the function and its linearization around ; see Lemma 4.5 in [AKPS19] for the precise statement. We use this result to derive the following Lemma.
Lemma 4.
Let . Under the assumptions of Theorem 5, we have
(12) | ||||
(13) |
As expected, while we do have the desired quadratic upper bound, the lower bound is much more cumbersome, and is only comparable to the second order Taylor expansion when . What we need for the proof to go through is a high probability lower bound of order on the first term in the lower bound (12). We obtain this in the following Proposition.
Proposition 2.
Proof.
Let and let be a truncation parameter we will set later. Define
and the constant . By Lemma 3.3 in [AKPS19], we have that for all . Furthermore, it is straightforward to verify that is decreasing in and increasing in . Therefore, we have, for all ,
(14) |
The key idea to control the infimum in (14) is to truncate from above by using the truncated vector , and from below by forcing it to be greater than . By the monotonicity properties of discussed above, we get that
(15) |
where the equality follows by the fact that with the chosen truncations, the second argument of is less than or equal to the first. It remains to lower bound the infimum in (15). Define
Removing the truncations and using our assumptions, we see that the components of have finite fourth moment. By Proposition 1 and the condition on , we get that with probability at least ,
(16) |
We now bound the supremum in (16). We have
(17) |
where the first inequality follows from Cauchy-Schwartz inequality, and the subsequent equalities by definitions of and . It remains to bound the tail probabilities. Recall that , so
where we applied Markov’s inequality in the fourth line, and the last follows by definition of in Theorem 5. Moreover by the finiteness of the second moment of the coordinates of , we have
where , and the last equality by definition of in Theorem 5. By Markov’s inequality
Choosing
ensures that
(18) |
Combining the inequalities (18), (17), (16), (15), and (14) yields the result. ∎
Acknowledgments and Disclosure of Funding
We thank Nikita Zhivotovskiy, Sushant Sachdeva, and Deeksha Adil for feedback on the manuscript. MAE was partially supported by NSERC Grant [2019-06167], CIFAR AI Chairs program, and CIFAR AI Catalyst grant.
References
- [AC11] Jean-Yves Audibert and Olivier Catoni “Robust Linear Least Squares Regression” In The Annals of Statistics, 2011 DOI: 10.1214/11-AOS918
- [AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for -norm Regression” In Proceedings of the 2019 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019 DOI: 10.1137/1.9781611975482.86
- [AKPS22] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Fast Algorithms for -Regression”, 2022 DOI: 10.48550/arXiv.2211.03963
- [APS19] Deeksha Adil, Richard Peng and Sushant Sachdeva “Fast, Provably Convergent IRLS Algorithm for p-norm Linear Regression” In Advances in Neural Information Processing Systems, 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/hash/46c7cb50b373877fb2f8d5c4517bb969-Abstract.html
- [Arc96] Miguel A. Arcones “The Bahadur-Kiefer Representation of Regression Estimators” In Econometric Theory, 1996 URL: https://www.jstor.org/stable/3532831
- [AS19] Deeksha Adil and Sushant Sachdeva “Faster -Norm Minimizing Flows, via Smoothed q-Norm Problems” In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019 DOI: 10.1137/1.9781611975994.54
- [BCLL18] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee and Yuanzhi Li “An Homotopy Method for Regression Provably beyond Self-Concordance and in Input-Sparsity Time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018 DOI: 10.1145/3188745.3188776
- [BRW92] Z.. Bai, C. Rao and Y. Wu “M-Estimation of Multivariate Linear Regression Parameters Under a Convex Discrepancy Function” In Statistica Sinica, 1992 URL: https://www.jstor.org/stable/24304129
- [CG17] Olivier Catoni and Ilaria Giulini “Dimension-Free PAC-Bayesian Bounds for Matrices, Vectors, and Linear Least Squares Regression”, 2017 DOI: 10.48550/arXiv.1712.02747
- [HKZ12] Daniel Hsu, Sham M. Kakade and Tong Zhang “Random Design Analysis of Ridge Regression” In Proceedings of the 25th Annual Conference on Learning Theory, 2012 URL: https://proceedings.mlr.press/v23/hsu12.html
- [HS96] Xuming He and Qi-Man Shao “A General Bahadur Representation of M-estimators and Its Application to Linear Regression with Nonstochastic Designs” In The Annals of Statistics, 1996 DOI: 10.1214/aos/1032181172
- [JLS22] Arun Jambulapati, Yang P. Liu and Aaron Sidford “Improved Iteration Complexities for Overconstrained -Norm Regression” In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, 2022 DOI: 10.1145/3519935.3519971
- [KM15] Vladimir Koltchinskii and Shahar Mendelson “Bounding the Smallest Singular Value of a Random Matrix Without Concentration” In International Mathematics Research Notices, 2015 DOI: 10.1093/imrn/rnv096
- [Kni98] Keith Knight “Limiting Distributions for Regression Estimators under General Conditions” In The Annals of Statistics, 1998 DOI: 10.1214/aos/1028144858
- [LC83] Erich L. Lehmann and George Casella “Theory of Point Estimation”, 1983
- [LL05] P.. Lai and Stephen M.. Lee “An Overview of Asymptotic Properties of Regression under General Classes of Error Distributions” In Journal of the American Statistical Association, 2005 URL: https://www.jstor.org/stable/27590567
- [LM16] Guillaume Lecué and Shahar Mendelson “Performance of Empirical Risk Minimization in Linear Aggregation” In Bernoulli, 2016 DOI: 10.3150/15-BEJ701
- [LM17] Guillaume Lecué and Shahar Mendelson “Regularization and the Small-Ball Method II: Complexity Dependent Error Rates” In Journal of Machine Learning Research, 2017 URL: http://jmlr.org/papers/v18/16-422.html
- [LM17a] Guillaume Lecué and Shahar Mendelson “Sparse Recovery under Weak Moment Assumptions” In Journal of the European Mathematical Society, 2017 DOI: 10.4171/jems/682
- [LM18] Guillaume Lecué and Shahar Mendelson “Regularization and the Small-Ball Method I: Sparse Recovery” In The Annals of Statistics, 2018 DOI: 10.1214/17-AOS1562
- [Men14] Shahar Mendelson “Learning without Concentration” In Proceedings of The 27th Conference on Learning Theory, 2014 URL: https://proceedings.mlr.press/v35/mendelson14.html
- [Men18] Shahar Mendelson “Learning without Concentration for General Loss Functions” In Probability Theory and Related Fields, 2018 DOI: 10.1007/s00440-017-0784-y
- [Men21] Shahar Mendelson “Learning Bounded Subsets of ” In IEEE Transactions on Information Theory, 2021 DOI: 10.1109/TIT.2021.3083553
- [MG22] Jaouad Mourtada and Stéphane Gaïffas “An Improper Estimator with Optimal Excess Risk in Misspecified Density Estimation and Logistic Regression” In Journal of Machine Learning Research, 2022 URL: http://jmlr.org/papers/v23/20-782.html
- [Mou22] Jaouad Mourtada “Exact Minimax Risk for Linear Least Squares, and the Lower Tail of Sample Covariance Matrices” In The Annals of Statistics, 2022 DOI: 10.1214/22-AOS2181
- [MVZ22] Jaouad Mourtada, Tomas Vaškevičius and Nikita Zhivotovskiy “Distribution-free robust linear regression” In Mathematical Statistics and Learning, 2022 DOI: 10.4171/MSL/27
- [Nie92] Wojciech Niemiro “Asymptotics for -Estimators Defined by Convex Minimization” In The Annals of Statistics, 1992 DOI: 10.1214/aos/1176348782
- [OB21] Dmitrii M. Ostrovskii and Francis Bach “Finite-Sample Analysis of -Estimators Using Self-Concordance” In Electronic Journal of Statistics, 2021 DOI: 10.1214/20-EJS1780
- [Oli16] Roberto Imbuzeiro Oliveira “The Lower Tail of Random Quadratic Forms with Applications to Ordinary Least Squares” In Probability Theory and Related Fields, 2016 DOI: 10.1007/s00440-016-0738-9
- [Ron84] Arjen E. Ronner “Asymptotic Normality of -norm Estimators in Multiple Regression” In Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 1984 DOI: 10.1007/BF00531893
- [RV15] Mark Rudelson and Roman Vershynin “Small Ball Probabilities for Linear Images of High-Dimensional Distributions” In International Mathematics Research Notices, 2015 DOI: 10.1093/imrn/rnu243
- [Sau18] Adrien Saumard “On Optimality of Empirical Risk Minimization in Linear Aggregation” In Bernoulli, 2018 DOI: 10.3150/17-BEJ925
- [Vaa98] A.. Vaart “Asymptotic Statistics”, 1998 DOI: 10.1017/CBO9780511802256
- [VW96] Aad W. Van Der Vaart and Jon A. Wellner “Weak Convergence and Empirical Processes”, 1996 DOI: 10.1007/978-1-4757-2545-2
- [VZ23] Tomas Vaškevičius and Nikita Zhivotovskiy “Suboptimality of Constrained Least Squares and Improvements via Non-Linear Predictors” In Bernoulli, 2023 DOI: 10.3150/22-BEJ1465
- [Whi82] Halbert White “Maximum Likelihood Estimation of Misspecified Models” In Econometrica, 1982 DOI: 10.2307/1912526
Appendix A Differentiability of the risk
In this section, we rigorously establish the twice differentiability the risk under our assumptions. We start by showing that under a subset of our assumptions, the risk is differentiable everywhere on .
Lemma 5.
Let and assume that and for all . Then is differentiable on , and
Proof.
Let . We want to show that
where, for convenience, we take the norm to be the Euclidean norm. Define the function and note that by the chain rule is differentiable as a function of on all of . Now let be a sequence in such that . Then
(19) |
Our goal is to interchange the limit and expectation. For that, we will use the dominated convergence theorem. We construct our dominating function as follows. Let , and note that since as . Then we have
where the second line follows by triangle inequality, the third from the fundamental theorem of calculus applied component-wise, the fourth by Cauchy-Schwartz inequality, the fifth by Jensen’s inequality and the convexity of the norm, and the eighth by the inequality valid for . It remains to show that is integrable. We have
where in the second line we used that the Euclidean norm is bounded by the -norm, in the third we used Holder’s inequality, and the last line follows from our assumptions. Applying the dominated convergence theorem, we interchange the limit and the expectation in (19). Recalling that is differentiable finishes the proof. ∎
We now turn to the twice differentiability of the risk. We start with the easy case . The proof is very similar to that of Lemma 5 and we omit it here.
Lemma 6.
Let and assume that and for all . Then is twice differentiable on , and
The case is more complicated. The following lemma establishes the twice differentiability of the risk at its minimizer under a subset of the assumptions of Theorem 5.
Lemma 7.
Let . Assume that and for all . Then is twice differentiable at and
Proof.
The difficulty in the proof compared to Lemma 5 and Lemma 6 stems from the fact that the loss is not twice differentiable at zero. We still rely on the dominated convergence theorem, but the construction of the dominating function is slightly more intricate. Using the setup of the proof of Lemma 5, and following the same line of arguments, we arrive at
(20) |
where we have used the fact that since , is almost surely twice differentiable at . To finish the proof, it remains to construct a dominating function for the above sequence to justify the interchange of the limit and expectation. We consider two cases.
Case 1: . Then we have
where the second line follows by triangle inequality, the third by definition of the operator norm, the fourth by valid for , and the fifth and sixth by Cauchy-Schwartz inequality and the assumed lower bound on .
Case 2: . We start by noting that, for all , we have
Therefore is twice differentiable on . Now
where the second line follows from the triangle inequality, the third follows from the twice differentiability of on and the fundamental theorem of calculus applied component-wise, the fifth by Jensen’s inequality, and the last by definition of and the above lower bound. We therefore define our dominating function by
It is then immediate from our assumptions that is integrable. Interchanging the limit and the expectation in (20) and recalling that is almost surely twice differentiable at finishes the proof.
∎
Appendix B Proof of Lemma 1
We start with the claim that is upper-semicontinuous. We want to show that for any and any sequence converging to (in the norm topology)
Fix a and let be a sequence in satisfying , where for convenience we take to be the Euclidean norm on . Then we have by (reverse) Fatou’s Lemma
(21) |
Now we bound the inner limsup pointwise. We split this task in two cases. If , then
(22) |
Otherwise we have . But then, by the convergence of to , there exists a such that for all we have . This implies that for all
We conclude that
(23) |
Combining (21), (22), and (23) proves the upper-semicontinuity of . Essentially the same proof shows the lower-semicontinuity of for any ; we omit it here.
For the remaining claims, first notice that the function is scale invariant, i.e. for all and all , we have . Therefore
where is the Euclidean unit sphere. By assumption on the random vector , we know that for all . Furthermore since is upper semicontinuous, and is compact, attains its supremum on at some point . From this we conclude that
Finally, we turn to the claim about . Since , the function is a norm on , from which it follows that is also scale invariant for any . Therefore
where is the unit sphere of the norm . Now fix . We claim that for all . Suppose not. Then there exists a such that with probability , but then we get the contradiction
therefore for all . Now since is lower-semicontinuous, and is compact, attains its infimum on at some point . From this we conclude
∎
Appendix C Proof of Theorem 3
Fix , and let . Our goal will be to upper bound the probability . By assumption, we have that , so that for all . Since minimizes the empirical risk, we must also have for all . Let denote the matrix whose -th row is . Then we have the following implications.
(24) |
Let . We claim the following equivalence
(25) |
Indeed the implication follows by definition of the rowrank of . For the implication , by definition, is a spanning set for the row space of , therefore it can be reduced to a basis of it for some indices with . If , then the choice satisfies the right side of (25). Otherwise, let with . Such a subset exists since by assumption . Then the set satisfies the right side of (25). Combining (24) and (25) we arrive at:
(26) |
where the second inequality follows from the union bound. We now bound each of the terms of the sum. Fix with . Let be a non-zero vector orthogonal to . Such a vector must exist since ; see Lemma 8 below for an explicit construction of the function . Denote by the distribution of and the distribution of , where is the distribution of . Note that since is a function of only, it is independent of . In particular, the joint distribution of is given by the product . Now if , then by definition of , . Therefore
(27) |
where in the third line we used the independence of and , in the fourth we used the independence of the , in the sixth we used the definition of in 5, and in the last line we used the fact that and the definition of in Lemma 1. Combining the inequalities (26) and (27) yields the result. ∎
Lemma 8.
Let and let be a sequence of points in . Denote by the matrix whose -th row is and let be its pseudo-inverse. Let be an ordered basis of , and define
Then is non-zero and is orthogonal to .
Proof.
We start by showing that is well defined. First note that is the orthogonal projector onto the kernel of , which is non-trivial since . Now we claim that there exists an such that . Suppose not, then for any , we have , implying that , but this contradicts the non-triviality of . This proves that is well-defined, which in turn proves that . It remains to prove the orthogonality claim. Let . Then there exists coefficients such that . Therefore
where the last equality holds since . ∎
Appendix D Missing proofs for Theorem 1
D.1 Proof of Lemma 2
We compute the expectation:
where in the second line we expanded the inner product of the sums into its terms, used linearity of expectation, and used the independence of the samples to take the expectation inside the inner product. In the last line, we used the fact that the samples are identically distributed to simplify the first term. For the second term, we used the fact that the expectation of the gradient of the loss at the risk minimizer vanishes. Applying Markov’s inequality finishes the proof. ∎
D.2 Proof of Corollary 1
We have
Now by assumption, the components of the vector have finite fourth moment so that applying Proposition 1 and using the condition on yields the result. ∎
Appendix E Detailed proof of Theorem 4
E.1 Proof of Lemma 3
By Lemma 2.5 in [AKPS22], we have for all
Recall that by the chain rule
Replacing and by and respectively, and using the formulas for the gradient and Hessian we arrive at
Averaging over yields the first inequality. The proof of the second inequality proceeds in the same way and uses instead the upper bound of Lemma 2.5 in [AKPS22]. We omit it here. ∎
E.2 Proof of Theorem 4
We proceed similarly to the proof of Theorem 1. By definition of the empirical risk minimizer, we have the upper bound
(28) |
Using (10) from Lemma 3 and the Cauchy-Schwarz inequality, we obtain the pointwise lower bound
(29) |
Using Lemma 2 we have that, with probability at least ,
(30) |
It remains to control from below. Define the random vector
Then, for any , we have
By assumption, the components of the random vector have finite fourth moment. Applying Proposition 1, and using the condition on assumed in the statement of Theorem 4, we get that, with probability at least , for all ,
(31) |
Combining (30) and (31) with (29) gives that with probability at least ,
(32) |
Further combining (32) with (28) and rearranging yields that with probability at least
(33) |
The last step is to bound the excess risk of the empirical risk minimizer using the bound (33) and (11) from Lemma 3. For that, we control the norm term in (11) as follows
(34) |
Appendix F Detailed proof of Theorem 5
F.1 Proof of Lemma 4
Both inequalities follow from Lemma 4.5 in [AKPS19]. (12) follows from a straightforward calculation using the lower bound of Lemma 4.5 in [AKPS19]; we omit it here. The upper bound requires a bit more work. We have by the quoted Lemma
Now assume that . If , we have
Otherwise, if , then we have
Therefore in both cases we have as long as . Replacing and by and respectively we get, on the event that
Recalling that by assumption , taking expectation of both sides, and bounding finishes the proof of (13). ∎
F.2 Proof of Theorem 5
We follow the same proof strategy as the one used in the proofs of Theorems 1 and 4. By definition of the empirical risk minimizer, we have
(35) |
Using (12) from Lemma 4 and the Cauchy-Schwarz inequality, we have the lower bound
(36) |
Using Lemma 2 we have that, with probability at least ,
(37) |
On the other hand, by Proposition 2, we have with probability , for all ,
(38) |
where is as defined in Proposition 2. We now consider two cases. If , then combining (35), (36), (37), and (38) gives
(39) |
Otherwise, , then again combining (35), (36), (37), and (38) gives
(40) |
In either case, we have, using (39) and (40), with probability at least ,
Combining this last inequality with (13) from Lemma 4 finishes the proof. ∎