This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Excess Risk Bounds for Empirical Risk Minimization on pp-Norm Linear Regression

Ayoub El Hanchi
University of Toronto & Vector Institute
[email protected] &Murat A. Erdogdu
University of Toronto & Vector Institute
[email protected]
Abstract

We study the performance of empirical risk minimization on the pp-norm linear regression problem for p(1,)p\in(1,\infty). We show that, in the realizable case, under no moment assumptions, and up to a distribution-dependent constant, O(d)O(d) samples are enough to exactly recover the target. Otherwise, for p[2,)p\in[2,\infty), and under weak moment assumptions on the target and the covariates, we prove a high probability excess risk bound on the empirical risk minimizer whose leading term matches, up to a constant that depends only on pp, the asymptotically exact rate. We extend this result to the case p(1,2)p\in(1,2) under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.

1 Introduction

Real-valued linear prediction is a fundamental problem in machine learning. Traditionally, the square loss has been the default choice for this problem. The performance of empirical risk minimization (ERM) on linear regression under the square loss, as measured by the excess risk, has been studied extensively both from an asymptotic [Whi82, LC83, Vaa98] and a non-asymptotic point of view [AC11, HKZ12, Oli16, LM16, Sau18, Mou22]. An achievement of the last decade has been the development of non-asymptotic excess risk bounds for ERM on this problem under weak assumptions, and which match, up to constant factors, the asymptotically exact rate.

In this paper, we consider the more general family of pp-th power losses t|t|pt\mapsto\lvert t\rvert^{p} for a user-chosen p(1,)p\in(1,\infty). Under mild assumptions, the classical asymptotic theory can still be applied to ERM under these losses, yielding the asymptotic distribution of the excess risk. However, to the best of our knowledge, the problem of deriving non-asymptotic excess risk bounds for ERM for p(1,){2}p\in(1,\infty)\setminus\{2\} remains open, and, as we discuss below, resists the application of standard tools from the literature.

Our motivation for extending the case p=2p=2 to p(1,)p\in(1,\infty) is twofold. Firstly, the freedom in the choice of pp allows us to better capture our prediction goals. For example, we might only care about how accurate our prediction is on average, in which case, the choice p=1p=1 is appropriate. At the other extreme, we might insist that we do as well as possible on a subset of inputs of probability 11, in which case the choice p=p=\infty is best. A choice of p(1,)p\in(1,\infty) therefore allows us to interpolate between these two extremes, with the case p=2p=2 offering a balanced choice. Secondly, different choices of pp have complementary qualities. On the one hand, small values of pp allow us to operate with weak moment assumptions, making them applicable in more general cases. On the other, larger values of pp yield predictions whose optimality is less sensitive to changes in the underlying distribution: for p=p=\infty, the best predictor depends only on the support of this distribution.

To sharpen our discussion, let us briefly formalize our problem. There is an input random vector XdX\in\mathbb{R}^{d} and output random variable YY\in\mathbb{R}, and we are provided with nn i.i.d. samples (Xi,Yi)i=1n(X_{i},Y_{i})_{i=1}^{n}. We select our set of predictors to be the class of linear functions {xw,xwd}\left\{x\mapsto\langle w,x\rangle\mid w\in\mathbb{R}^{d}\right\}, and choose a value p(1,)p\in(1,\infty) with the corresponding loss p(t):=|t|p/[p(p1)]\ell_{p}(t)\vcentcolon=\lvert t\rvert^{p}/[p(p-1)]. Using this loss, we define the associated risk and empirical risk, respectively, by

Rp(w)\displaystyle R_{p}(w) :=E[p(w,XY)],\displaystyle\vcentcolon=\operatorname{E}[\ell_{p}(\langle w,X\rangle-Y)], Rp,n(w):=1ni=1np(w,XiYi).\displaystyle R_{p,n}(w)\vcentcolon=\frac{1}{n}\sum_{i=1}^{n}\ell_{p}(\langle w,X_{i}\rangle-Y_{i}).

We perform empirical risk minimization w^pargminwdRp,n(w)\hat{w}_{p}\in\operatorname*{\arg\!\min}_{w\in\mathbb{R}^{d}}R_{p,n}(w), and our goal is to derive a high probability bound on the excess risk Rp(w^p)Rp(wp)R_{p}(\hat{w}_{p})-R_{p}(w_{p}^{*}), where wpw_{p}^{*} is the risk minimizer. For efficient algorithms for computing an empirical risk minimizer w^p\hat{w}_{p}, we refer the reader to the rich recent literature dealing with this problem [BCLL18, AKPS19, APS19, JLS22].

To see why the problem we are considering is difficult, let us briefly review some of the recent literature. Most closely related to our problem are the results of [AC11, HKZ12, Oli16, LM16], who derive high probability non-asymptotic excess risk bounds for the case p=2p=2. The best such bounds are found in [Oli16] and [LM16], who both operate under weak assumptions on (X,Y)(X,Y), requiring at most the existence of fourth moments of YY and the components XjX^{j} of XX for j[d]j\in[d]. Unfortunately, the analysis in [Oli16] relies on the closed form expression of the empirical risk minimizer w^2\hat{w}_{2}, and therefore cannot be extended to other values of pp. Similarly, the analysis in [LM16] relies on an exact decomposition of the excess loss 2(w,XY)2(wp,XY)\ell_{2}(\langle w,X\rangle-Y)-\ell_{2}(\langle w_{p}^{*},X\rangle-Y) in terms of “quadratic” and “multiplier” components, which also does not extend to other values of pp.

To address these limitations, the work of [Men18] extends the ideas of [Men14, LM16] to work for loss functions more general than the square loss. Roughly speaking, the main result of [Men18] states that as long as the loss is strongly convex and smooth in a neighbourhood of 0, the techniques developed by [Men14] can still be applied to obtain high probability excess risk bounds. Unfortunately, the loss functions p(t)\ell_{p}(t) are particularly ill-behaved in precisely this sense, as p′′(t)0\ell_{p}^{\prime\prime}(t)\to 0 when t0t\to 0 for p>2p>2, and |p′′(t)|\lvert\ell_{p}^{\prime\prime}(t)\rvert\to\infty as t0t\to 0 for p(1,2)p\in(1,2). This makes the analysis of the excess risk of ERM in the case p(1,){2}p\in(1,\infty)\setminus\{2\} particularly challenging using well-established methods.

Contrary to the non-asymptotic regime, the asymptotic properties of the excess risk of ERM under the losses p(t)\ell_{p}(t) are better understood [Ron84, BRW92, Nie92, Arc96, HS96, LL05], and can be derived from the more general classical asymptotic theory of MM-estimators [LC83, VW96, Vaa98] under mild regularity conditions. In particular, these asymptotic results imply that the excess risk of ERM with nn samples satisfies

E[Rp(w^p)]Rp(wp)=E[p(wp,XY)Hp12]2n+o(1n)asn,\operatorname{E}[R_{p}(\hat{w}_{p})]-R_{p}(w_{p}^{*})=\frac{\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert^{2}_{H_{p}^{-1}}\right]}{2n}+o\left\lparen\frac{1}{n}\right\rparen\quad\text{as}\quad n\to\infty, (1)

where Hp:=2Rp(wp)H_{p}\vcentcolon=\nabla^{2}R_{p}(w_{p}^{*}) is the Hessian of the risk at its minimizer. We refer the reader to the discussions in [OB21, MG22] for more details. As we demonstrate in Theorem 1, the rate of convergence of ERM for the square loss derived in [Oli16] and [LM16] matches the asymptotic rate (1) up to a constant factor. Ideally, we would like our excess risk bounds for the cases p(1,){2}p\in(1,\infty)\setminus\{2\} to also match the asymptotic rate (1), although it is not yet clear how to derive any meaningful such bounds.

In this paper, we prove the first high probability non-asymptotic excess risk bounds for ERM under the pp-th power losses p(t)\ell_{p}(t) for any p(1,){2}p\in(1,\infty)\setminus\{2\}. Our assumptions on (X,Y)(X,Y) are weak, arise naturally from the analysis, and reduce to the standard ones for the case p=2p=2. Furthermore, the rate we derive matches, up to a constant that depends only on pp, the asymptotically exact rate (1).

We split the analysis in three cases. The first is when the problem is realizable, i.e. Y=w,XY=\langle w^{*},X\rangle for some wdw^{*}\in\mathbb{R}^{d}. This edge case is not problematic for the analysis of the case p=2p=2, but as discussed above, the p(t)\ell_{p}(t) losses are ill-behaved around 0 for p(1,){2}p\in(1,\infty)\setminus\{2\}, requiring us to treat this case separately. The second case is when the problem is not realizable and p(2,)p\in(2,\infty). The final case is when the problem is not realizable and p(1,2)p\in(1,2), which turns out to be the most technically challenging. In Section 2, we present our main results and in Section 3, we provide their proofs.

Notation. We denote the components of the random vector XdX\in\mathbb{R}^{d} by XjX^{j} for j[d]j\in[d]. We assume the support of XX is not contained in any hyperplane, i.e. P(w,X=0)=1\operatorname{P}\lparen\langle w,X\rangle=0\rparen=1 only if w=0w=0. This is without loss of generality as discussed in [Oli16, Mou22]. For a positive semi-definite matrix AA, we denote the bilinear form it induces on d\mathbb{R}^{d} by ,A\langle\cdot,\cdot\rangle_{A}, and define A:=,A\lVert\cdot\rVert_{A}\vcentcolon=\sqrt{\langle\cdot,\cdot\rangle_{A}}. We define Hp,n:=2Rp,n(wp)H_{p,n}\vcentcolon=\nabla^{2}R_{p,n}(w_{p}^{*}).

2 Main results

In this section, we state our main results. We start in Section 2.1 where we introduce constants that help us formulate our theorems. In Section 2.2, we state the best known results for both the case p=2p=2 and the realizable case where Y=w,XY=\langle w^{*},X\rangle. Finally, in Section 2.3, we state our theorems.

2.1 Norm equivalence and small ball constants

To state our results, we will need to define two types of quantities first. The first kind are related to norms and their equivalence constants, which we will use in the analysis of the non-realizable case. The second are small ball probabilities, which we will use in the analysis of the realizable case.

We start by introducing the following functions on our space of coefficients d\mathbb{R}^{d}. For p,q[1,)p,q\in[1,\infty), define, with the convention 1/p:=\infty^{1/p}\vcentcolon=\infty for all p[1,)p\in[1,\infty),

wLp\displaystyle\lVert w\rVert_{L^{p}} :=E[|w,X|p]1/p,\displaystyle\vcentcolon=\operatorname{E}[\lvert\langle w,X\rangle\rvert^{p}]^{1/p}, wLq,p\displaystyle\lVert w\rVert_{L^{q},p} :=E[w2p(wp,XY)q]1/q.\displaystyle\vcentcolon=\operatorname{E}[\lVert w\rVert_{\nabla^{2}\ell_{p}(\langle w^{*}_{p},X\rangle-Y)}^{q}]^{1/q}. (2)

As suggested by the notation, under appropriate moment assumptions on XX, these functions are indeed norms on d\mathbb{R}^{d}. In that case, we will be interested in norm equivalence constants between them

Cab\displaystyle C_{a\to b} :=supwd{0}wawb,\displaystyle\vcentcolon=\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\frac{\lVert w\rVert_{a}}{\lVert w\rVert_{b}}, σp2\displaystyle\sigma_{p}^{2} :=C(L4,p)(L2,p)4,\displaystyle\vcentcolon=C_{(L^{4},p)\to(L^{2},p)}^{4}, (3)

where aa and bb stand for one of LpL^{p} or (Lq,p)(L^{q},p). Let us note that since we work in a finite dimensional vector space, all norms are equivalent, so that as soon as the quantities defined in (2) are indeed norms, the constants defined in (3) are finite. Furthermore, as suggested by the notation, σp2\sigma^{2}_{p} may be viewed as the maximum second moment of the random variables w2p(wp,XY)2\lVert w\rVert_{\nabla^{2}\ell_{p}(\langle w_{p}^{*},X\rangle-Y)}^{2} over the unit sphere of L2,p\lVert\cdot\rVert_{L^{2},p}. Finally, we record the following identities for future use

wL2,p\displaystyle\lVert w\rVert_{L^{2},p} =wHp,\displaystyle=\lVert w\rVert_{H_{p}}, wLq,2\displaystyle\lVert w\rVert_{L^{q},2} =wLq,\displaystyle=\lVert w\rVert_{L^{q}}, σ22=CL4,L24.\displaystyle\sigma^{2}_{2}=C_{L^{4},L^{2}}^{4}. (4)

The first identity holds by linearity, and the second by noticing that 22(w,XY)=XXT\nabla^{2}\ell_{2}(\langle w,X\rangle-Y)=XX^{T}.

We now turn to small ball probabilities. We define the following functions on d\mathbb{R}^{d}, for q[1,)q\in[1,\infty),

ρ0(w)\displaystyle\rho_{0}(w) :=P(w,X=0),\displaystyle\vcentcolon=\operatorname{P}\lparen\langle w,X\rangle=0\rparen, ρq(w,κ)\displaystyle\rho_{q}(w,\kappa) :=P(|w,X|>κwLq).\displaystyle\vcentcolon=\operatorname{P}\lparen\lvert\langle w,X\rangle\rvert>\kappa\lVert w\rVert_{L^{q}}\rparen. (5)

Assumptions on the functions ρ0\rho_{0} and ρ2\rho_{2} have been used extensively in the recent literature, see e.g. [Men14, KM15, LM17, LM17a, Men18, LM18, Mou22]. In particular, a standard assumption postulates the existence of strictly positive constants β0\beta_{0}, and (β2,κ2)(\beta_{2},\kappa_{2}) such that ρ0(w)1β0\rho_{0}(w)\leq 1-\beta_{0} and ρ2(w,κ2)β2\rho_{2}(w,\kappa_{2})\geq\beta_{2} for all wdw\in\mathbb{R}^{d}. Conditions of this type are usually referred to as small ball conditions. Efforts have been made to understand when these conditions hold [Men14, RV15, LM17a] as well as reveal the dimension dependence of the constants with which they do [Sau18]. Here we prove that such conditions always hold for finite dimensional spaces. We leave the proof of Lemma 1 to Appendix B to not distract from our main development.

Lemma 1.

ρ0\rho_{0} is upper semi-continuous. Furthermore, if for some q[1,)q\in[1,\infty), E[|Xj|q]<\operatorname{E}[\lvert X^{j}\rvert^{q}]<\infty for all j[d]j\in[d], then ρq(,κ)\rho_{q}(\cdot,\kappa) is lower semi-continuous for any κ0\kappa\geq 0. Moreover, for all κ[0,1)\kappa\in[0,1)

ρ\displaystyle\rho :=supwd{0}ρ0(w)<1,\displaystyle\vcentcolon=\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{0}(w)<1, infwd{0}ρq(w,κ)\displaystyle\inf_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{q}(w,\kappa) >0.\displaystyle>0.

2.2 Background

To better contextualize our results, we start by stating the best known high probability bound on ERM for the square loss, which we deduce from [Oli16] and [LM16].

Theorem 1 (Theorem 4.2, [Oli16]; Theorem 1.3, [LM16]).

Assume that E[Y2]<\operatorname{E}[Y^{2}]<\infty and E[(Xj)4]<\operatorname{E}[(X^{j})^{4}]<\infty for all j[d]j\in[d], and let δ(0,1]\delta\in(0,1]. If

n196σ22(d+2log(4/δ)),n\geq 196\sigma^{2}_{2}\left\lparen d+2\log(4/\delta)\right\rparen,

then, with probability at least 1δ1-\delta

R2(w^2)R2(w2)16E[2(w2,XY)H212]nδ.R_{2}(\hat{w}_{2})-R_{2}(w_{2}^{*})\leq\frac{16\operatorname{E}[\lVert\nabla\ell_{2}(\langle w^{*}_{2},X\rangle-Y)\rVert^{2}_{H_{2}^{-1}}]}{n\delta}.

Up to a constant factor and the dependence on δ\delta, Theorem 1 recovers the asymptotically exact rate (1). Let us briefly comment on the differences between Theorem 1 and the comparable statements in the original papers. First, the finiteness of σ22\sigma_{2}^{2} is deduced from the finiteness of the fourth moments of the components of XX, instead of being assumed as in [Oli16] (see the discussion in Section 3.1 in [Oli16]). Second we combine Theorem 3.1 from [Oli16] with the proof technique of [LM16] to achieve a slightly better bound that the one achieved by the proof technique used in the proof of Theorem 4.2 in [Oli16], while avoiding the dependence on the small ball-constant present in the bound of Theorem 1.3 in [LM16], which is known to incur additional dimension dependence in some cases [Sau18].

We now move to the realizable case, where Y=w,XY=\langle w^{*},X\rangle so that wp=ww_{p}^{*}=w^{*} for all p(1,)p\in(1,\infty). We immediately note that Theorem 1 is still applicable in this case, and ensures that we recover ww^{*} exactly with no more than n=O(σ22d)n=O(\sigma_{2}^{2}d) samples. However, we can do much better, while getting rid of all the moment assumptions in Theorem 1. Indeed, it is not hard to see that w^pw\hat{w}_{p}\neq w^{*} only if for some wd{0}w\in\mathbb{R}^{d}\setminus\{0\}, w,Xi=0\langle w,X_{i}\rangle=0 for all i[n]i\in[n] (taking w=w^pwpw=\hat{w}_{p}-w_{p}^{*} works). The implicit argument in Theorem 1 then uses the pointwise bound (see Lemma B.2 in [Oli16])

P(i=1n{w,Xi=0})exp(n2σ22),\operatorname{P}\lparen\cap_{i=1}^{n}\{\langle w,X_{i}\rangle=0\}\rparen\leq\exp\left\lparen-\frac{n}{2\sigma^{2}_{2}}\right\rparen,

and uniformizes it over the L2L^{2} unit sphere in d\mathbb{R}^{d}, where the L2L^{2} norm is as defined in (2). However, we can use the much tighter bound ρn\rho^{n} where ρ\rho is as defined in Lemma 1. To the best of our knowledge, the realizable case has not been studied explicitly before in the literature. However, with the above considerations in mind, we can deduce the following result from [LM17a], which uniformizes the pointwise bound we just discussed using a VC dimension argument.

Theorem 2 (Corollary 2.5, [LM17a]).

Assume that there exists wdw^{*}\in\mathbb{R}^{d} such that Y=w,XY=\langle w^{*},X\rangle. Let δ(0,1]\delta\in(0,1]. If

nO(d+log(1/δ)(1ρ)2)n\geq O\left\lparen\frac{d+\log(1/\delta)}{(1-\rho)^{2}}\right\rparen

then for any p(1,)p\in(1,\infty), w^p=w\hat{w}_{p}=w^{*} with probability at least 1δ1-\delta.

2.3 Results

We are now in position to state our main results. As discussed in Section 1, the p(t)\ell_{p}(t) losses have degenerate second derivatives as t0t\to 0. When the problem is realizable, the risk is not twice differentiable at its minimizer for the cases p(1,2)p\in(1,2), and is degenerate for the cases p(2,)p\in(2,\infty). If we want bounds of the form (1), we must exclude this case from our analysis. Our first main result is a strengthening of Theorem 2, and relies on a combinatorial argument to uniformize the pointwise estimate discussed in Section 2.2.

Theorem 3.

Assume that there exists wdw^{*}\in\mathbb{R}^{d} such that w,X=Y\langle w^{*},X\rangle=Y. Then for all ndn\geq d, and for all p(1,)p\in(1,\infty), we have

P(w^pw)(nd1)ρnd+1,\operatorname{P}\left\lparen\hat{w}_{p}\neq w^{*}\right\rparen\leq{n\choose d-1}\rho^{n-d+1},

where ρ\rho is as defined in Lemma 1. Furthermore, if

n{O(d+log(1/δ)/log(1/ρ)) if 0ρ<e1,O(d+log(1/δ)1ρ) if e1ρ<e1/e,O(dlog(1/(1ρ))+log(1/δ)1ρ) if e1/eρ<1,n\geq\begin{dcases*}O\left(d+\log\lparen 1/\delta\rparen/\log\lparen 1/\rho\rparen\right)&\quad if \quad$0\leq\rho<e^{-1}$,\\ O\left(\frac{d+\log\lparen 1/\delta\rparen}{1-\rho}\right)&\quad if \quad$e^{-1}\leq\rho<e^{-1/e}$,\\ O\left(\frac{d\log\lparen 1/(1-\rho)\rparen+\log\lparen 1/\delta\rparen}{1-\rho}\right)&\quad if \quad$e^{-1/e}\leq\rho<1$,\end{dcases*}

then with probability at least 1δ1-\delta, w^p=w\hat{w}_{p}=w^{*}.

Comparing Theorem 2 and Theorem 3, we see that the bound on the number of samples required to reach a confidence level δ\delta in Theorem 3 is uniformly smaller than the one in Theorem 2. The proof of Theorem 3 can be found in Appendix C.

We now move to the more common non-realizable case. Our first theorem here gives a non-asymptotic bound for the excess risk of ERM under a pp-th power loss for p(2,)p\in(2,\infty). To the best of our knowledge, no such result is known in the literature.

Theorem 4.

Let p(2,)p\in(2,\infty) and δ(0,1]\delta\in(0,1]. Assume that no wdw\in\mathbb{R}^{d} satisfies Y=w,XY=\langle w,X\rangle. Further, assume that E[|Y|p]<\operatorname{E}[\lvert Y\rvert^{p}]<\infty, E[|Xj|p]<\operatorname{E}[\lvert X^{j}\rvert^{p}]<\infty, and E[|wp,XY|2(p2)(Xj)4]<\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}(X^{j})^{4}]<\infty for all j[d]j\in[d]. If

n196σp2(d+2log(4/δ)),n\geq 196\sigma^{2}_{p}\lparen d+2\log\lparen 4/\delta\rparen\rparen,

then with probability at least 1δ1-\delta

Rp(w^p)Rp(wp)2048p2E[p(wp,XY)Hp12]nδ+(512p4cp2E[p(wp,XY)Hp12]nδ)p/2,R_{p}(\hat{w}_{p})-R_{p}(w^{*}_{p})\leq\frac{2048p^{2}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\\ +\left\lparen\frac{512p^{4}c_{p}^{2}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\right\rparen^{p/2},

where we used cpc_{p} to denote CLp(L2,p)C_{L^{p}\to(L^{2},p)} as defined in (3).

Up to a constant factor that depends only on pp and the dependence on δ\delta, the bound of Theorem 4 is precisely of the form of the optimal bound (1). Indeed, as p>2p>2, the second term is o(1/n)o(1/n). At the level of assumptions, the finiteness of the pp-th moment of YY and the components of XX is necessary to ensure that the risk RpR_{p} is finite for all wdw\in\mathbb{R}^{d}. The last assumption E[|Ywp,X|2(p2)(Xj)4]<\operatorname{E}[\lvert Y-\langle w_{p}^{*},X\rangle\rvert^{2(p-2)}(X^{j})^{4}]<\infty is a natural extension of the fourth moment assumption in Theorem 1. In fact, all three assumptions in Theorem 4 reduce to those of Theorem 1 as p2p\to 2. It is worth noting that the constant cpc_{p} has the alternative expression supwd{0}{wLp/wHp}\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\{\lVert w\rVert_{L^{p}}/\lVert w\rVert_{H_{p}}\} by (4), i.e. it is the norm equivalence constant between the LpL^{p} norm and the norm induced by HpH_{p}. Using again (4), we see that cp1c_{p}\to 1 as p2p\to 2. As pp\to\infty, cpc_{p} grows, and we suspect in a dimension dependent way. However, this does not affect the asymptotic optimality of our rate as cpc_{p} only enters an o(1/n)o(1/n) term in our bound.

We now turn to the case of p(1,2)p\in(1,2) where we need a slightly stronger version of non-realizability.

Theorem 5.

Let p(1,2)p\in(1,2) and δ(0,1]\delta\in(0,1]. Assume that P(|wp,XY|2p>0)=1\operatorname{P}\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2-p}>0\rparen=1 and E[|wp,XY|2(p2)]<\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}]<\infty. Further, assume that E[|Y|p]<\operatorname{E}[\lvert Y\rvert^{p}]<\infty, E[(Xj)2]<\operatorname{E}[(X^{j})^{2}]<\infty, E[|wp,XY|2(p2)(Xj)4]<\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}(X^{j})^{4}]<\infty for all j[d]j\in[d]. If

n196σp2(d+2log(4/δ)),n\geq 196\sigma_{p}^{2}(d+2\log(4/\delta)),

then, with probability at least 1δ1-\delta

Rp(w^p)Rp(wp)32768p1E[p(wp,XY)Hp12]nδ+1p1(2097152E[p(wp,XY)Hp12]σp62pcp2pcpnδ)1/(p1)R_{p}(\hat{w}_{p})-R_{p}(w_{p}^{*})\leq\frac{32768}{p-1}\frac{\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\\ +\frac{1}{p-1}\left\lparen\frac{2097152\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]\sigma_{p}^{6-2p}c_{p}^{2-p}c^{*}_{p}}{n\delta}\right\rparen^{1/(p-1)}

where we used cpc^{*}_{p} to denote E[|Ywp,X|2(p2)]\operatorname{E}[\lvert Y-\langle w_{p}^{*},X\rangle\rvert^{2(p-2)}] and cp:=Tr(Hp1Σ)c_{p}\vcentcolon=\operatorname{Tr}\left\lparen H_{p}^{-1}\Sigma\right\rparen where Σ:=E[XXT]\Sigma\vcentcolon=\operatorname{E}\left[XX^{T}\right].

Just like the bounds of Theorems 1 and 4, the bound of Theorem 5 is asymptotically optimal up to a constant factor that depends only on pp. Indeed, since 1<p<21<p<2, 1/(p1)>11/(p-1)>1, and the second term is o(1/n)o(1/n). At the level of assumptions, we have two additional conditions compared to Theorem 4. First, we require the existence of the second moment of the covariates instead of just the pp-th moment. Second, we require a stronger version of non-realizability by assuming the existence of the 2(2p)2(2-p) negative moment of |wp,XY|\lvert\langle w_{p}^{*},X\rangle-Y\rvert. In the majority of applications, an intercept variable is included as a covariate, i.e. X1=1X^{1}=1, so that this negative moment assumption is already implied by the standard assumption E[|Ywp,X|2(p2)(Xj)4]<\operatorname{E}[\lvert Y-\langle w_{p}^{*},X\rangle\rvert^{2(p-2)}(X^{j})^{4}]<\infty. In the rare case where an intercept variable is not included, any negative moment assumption on |wp,XY|\lvert\langle w_{p}^{*},X\rangle-Y\rvert can be used instead, at the cost of a larger factor in the o(1/n)o(1/n) term.

Finally, it is worth noting that for the cases p[1,2)p\in[1,2), there are situations where the asymptotic bound (1) does not hold, as the limiting distribution of the coefficients w^p\hat{w}_{p} as nn\to\infty does not necessarily converge to a Gaussian, and depends heavily on the distribution of wp,XY\langle w_{p}^{*},X\rangle-Y, see e.g. [LL05] and [Kni98]. Overall, we suspect that perhaps a slightly weaker version of our assumptions is necessary for a fast rate like (1) to hold.

3 Proofs

3.1 Proof of Theorem 1

Here we give a detailed proof of Theorem 1. While the core technical result can be deduced by combining results from [Oli16] and [LM16], here we frame the proof in a way that makes it easy to extend to the cases p(1,)p\in(1,\infty), and differently from either paper. We split the proof in three steps. First notice that since the loss is a quadratic function of ww, we can express it exactly using a second order Taylor expansion around the minimizer w2w_{2}^{*}

2(w,XY)2(w2,XY)=2(w2,XY),ww2+12ww222(w2,XY)2.\ell_{2}(\langle w,X\rangle-Y)-\ell_{2}(\langle w^{*}_{2},X\rangle-Y)=\langle\nabla\ell_{2}(\langle w^{*}_{2},X\rangle-Y),w-w_{2}^{*}\rangle+\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{\nabla^{2}\ell_{2}(\langle w^{*}_{2},X\rangle-Y)}^{2}.

Taking empirical averages and expectations of both sides respectively shows that the excess empirical risk and excess risk also admit such an expansion

R2,n(w)R2,n(w2)\displaystyle R_{2,n}(w)-R_{2,n}(w^{*}_{2}) =R2,n(w2),ww2+12ww2H2,n2,\displaystyle=\langle\nabla R_{2,n}(w_{2}^{*}),w-w_{2}^{*}\rangle+\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{H_{2,n}}^{2},
R2(w)R2(w2)\displaystyle R_{2}(w)-R_{2}(w^{*}_{2}) =12ww2H22,\displaystyle=\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{H_{2}}^{2}, (6)

where in the second equality we used that the gradient of the risk vanishes at the minimizer w2w_{2}^{*}. Therefore, to bound the excess risk, it is sufficient to bound the norm ww2H2\lVert w-w_{2}^{*}\rVert_{H_{2}}. This is the goal of the second step, where we use two ideas. First, by definition, the excess empirical risk of the empirical risk minimizer satisfies the upper bound

R2,n(w^2)R2,n(w2)0.R_{2,n}(\hat{w}_{2})-R_{2,n}(w_{2}^{*})\leq 0. (7)

Second, we use the Cauchy-Schwartz inequality to lower bound the excess empirical risk by

R2,n(w^2)R2,n(w2)R2,n(w2)H21w^2w2H2+12w^2w2H2,n2,R_{2,n}(\hat{w}_{2})-R_{2,n}(w_{2}^{*})\geq-\lVert\nabla R_{2,n}(w_{2}^{*})\rVert_{H_{2}^{-1}}\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}+\frac{1}{2}\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2,n}}^{2}, (8)

and we further lower bound it by deriving high probability bounds on the two random terms R2,n(w2)H21\lVert\nabla R_{2,n}(w_{2}^{*})\rVert_{H_{2}^{-1}} and w^2w2H2,n2\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2,n}}^{2}. The first can easily be bounded using Chebyshev’s inequality and the elementary fact that the variance of the average of nn i.i.d. random variables is the variance of their common distribution divided by nn. Here we state the result for all p(1,)p\in(1,\infty); the straightforward proof can be found in the Appendix D.

Lemma 2.

Let p(1,)p\in(1,\infty). If p(1,2)p\in(1,2), let the assumptions of Theorem 5 hold. Then with probability at least 1δ/21-\delta/2

Rp,n(wp)Hp12E[p(wp,XY)Hp12]/(nδ).\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\leq\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}.

For the second random term w^2w2H2,n2\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2,n}}^{2}, we use Theorem 3.1 of [Oli16], which we restate here, emphasizing that the existence of fourth moments of the components of the random vector is enough to ensure the existence of the needed norm equivalence constant.

Proposition 1 (Theorem 3.1, [Oli16]).

Let ZdZ\in\mathbb{R}^{d} be a random vector satisfying E[Zj4]<\operatorname{E}[Z_{j}^{4}]<\infty for all j[d]j\in[d] and assume that P(v,Z=0)=1\operatorname{P}\left\lparen\langle v,Z\rangle=0\right\rparen=1 only if v=0v=0. For p[1,)p\in[1,\infty) and vdv\in\mathbb{R}^{d}, define

vLp\displaystyle\lVert v\rVert_{L^{p}} :=E[(v,Z)p]1/p,\displaystyle\vcentcolon=\operatorname{E}[(\langle v,Z\rangle)^{p}]^{1/p}, σ2\displaystyle\sigma^{2} :=(supvd{0}vL4/vL2)4.\displaystyle\vcentcolon=\left\lparen\sup_{v\in\mathbb{R}^{d}\setminus\{0\}}\lVert v\rVert_{L^{4}}/\lVert v\rVert_{L^{2}}\right\rparen^{4}.

Let (Zi)i=1n(Z_{i})_{i=1}^{n} be i.i.d. samples of ZZ. Then, with probability at least 1δ1-\delta, for all vdv\in\mathbb{R}^{d},

1ni=1nv,Zi2(17σd+2log(2/δ)n)vL22.\frac{1}{n}\sum_{i=1}^{n}\langle v,Z_{i}\rangle^{2}\geq\left\lparen 1-7\sigma\sqrt{\frac{d+2\log(2/\delta)}{n}}\right\rparen\lVert v\rVert_{L^{2}}^{2}.

Using this result we can immediately deduce the required high probability lower bound on the second random term w^2w2H2,n2\lVert\hat{w}_{2}-w^{*}_{2}\rVert_{H_{2,n}}^{2}; we leave the obvious proof to Appendix D.

Corollary 1.

Under the assumptions of Theorem 1, if n196σ22(d+2log(4/δ))n\geq 196\sigma_{2}^{2}(d+2\log(4/\delta)), then with probability at least 1δ/21-\delta/2, for all wdw\in\mathbb{R}^{d},

ww2H2,n212ww2H22.\lVert w-w_{2}^{*}\rVert_{H_{2,n}}^{2}\geq\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{H_{2}}^{2}.

Combining Lemma 2, Corollary 1, and (8) yields that with probability at least 1δ1-\delta

R2,n(w^2)R2,n(w2)2E[p(wp,XY)Hp12]/(nδ)w^2w2H2+14w^2w2H22.R_{2,n}(\hat{w}_{2})-R_{2,n}(w_{2}^{*})\geq-\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}\,\,\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}+\frac{1}{4}\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}^{2}. (9)

Finally, combining (7) and (9) gives that with probability at least 1δ1-\delta

w^2w2H242E[p(wp,XY)Hp12]/(nδ).\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}\leq 4\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}.

Replacing in (6) finishes the proof. ∎

3.2 Proof Sketch of Theorem 4

The main challenge in moving from the case p=2p=2 to the case p(2,)p\in(2,\infty) is that the second order Taylor expansion of the loss is no longer exact. The standard way to deal with this problem is to assume that the loss is upper and lower bounded by quadratic functions, i.e. that it is smooth and strongly convex. Unfortunately, as discussed in Section 1, the p\ell_{p} loss is not strongly convex for any p>2p>2, so we need to find another way to deal with this issue. Once this has been resolved however, the strategy we used in the proof of Theorem 1 can be applied almost verbatim to yield the result. Remarkably, a result of [AKPS22] allows us to upper and lower bound the pp-th power loss for p(2,)p\in(2,\infty) by its second order Taylor expansion around a point, up to some residual terms. An application of this result yields the following Lemma.

Lemma 3.

Let p(2,)p\in(2,\infty). Then:

Rp,n(w)Rp,n(wp)\displaystyle R_{p,n}(w)-R_{p,n}(w_{p}^{*}) 18(p1)wwpHp,n2+Rp,n(wp),wwp,\displaystyle\geq\frac{1}{8(p-1)}\lVert w-w_{p}^{*}\rVert_{H_{p,n}}^{2}+\langle\nabla R_{p,n}(w_{p}^{*}),w-w_{p}^{*}\rangle, (10)
Rp(w)Rp(wp)\displaystyle R_{p}(w)-R_{p}(w_{p}^{*}) 2p(p1)wwpHp2+ppwwpLpp.\displaystyle\leq\frac{2p}{(p-1)}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2}+p^{p}\lVert w-w_{p}^{*}\rVert_{L^{p}}^{p}. (11)

Up to constant factors that depend only on pp and an LpL^{p} norm residual term, Lemma 3 gives matching upper and lower bounds on the excess risk and excess empirical risk in terms of their second order Taylor expansions around the minimizer. We can thus use the approach taken in the proof of Theorem 1 to obtain the result. The only additional challenge is the control of the term w^pwpLp\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{L^{p}}, which we achieve by reducing it to an w^pwpHp{\lVert\hat{w}_{p}-w_{p}^{*}\rVert}_{H_{p}} term using norm equivalence. A detailed proof of Theorem 4, including the proof of Lemma 3, can be found in Appendix E.

3.3 Proof Sketch of Theorem 5

The most technically challenging case is when p(1,2)p\in(1,2). Indeed as seen in the proof of Theorem 1, the most involved step is lower bounding the excess empirical risk with high probability. For the case p[2,)p\in[2,\infty), we achieved this by having access to a pointwise quadratic lower bound, which is not too surprising. Indeed, at small scales, we expect the second order Taylor expansion to be accurate, while at large scales, we expect the pp-th power loss to grow at least quadratically for p[2,)p\in[2,\infty).

In the case of p(1,2)p\in(1,2), we are faced with a harder problem. Indeed, as p1p\to 1, the p\ell_{p} losses behave almost linearly at large scales. This means that we cannot expect to obtain a global quadratic lower bound as for the case p[2,)p\in[2,\infty), so we will need a different proof technique. Motivated by related concerns, [BCLL18] introduced the following approximation to the pp-th power function

γp(t,x):={p2tp2x2 if xtxp(1p2)tp if x>t,\gamma_{p}(t,x)\vcentcolon=\begin{dcases*}\frac{p}{2}t^{p-2}x^{2}&\quad if \quad$x\leq t$\\ x^{p}-\left\lparen 1-\frac{p}{2}\right\rparen t^{p}&\quad if \quad$x>t$,\end{dcases*}

for t,x[0,)t,x\in[0,\infty) and with γp(0,0)=0\gamma_{p}(0,0)=0. This function was further studied by [AKPS19], who showed in particular that for any tt\in\mathbb{R}, the function xγp(|t|,|x|)x\mapsto\gamma_{p}(\lvert t\rvert,\lvert x\rvert) is, up to constants that depend only on pp, equal to the gap between the function xp(t+x)x\mapsto\ell_{p}(t+x) and its linearization around 0; see Lemma 4.5 in [AKPS19] for the precise statement. We use this result to derive the following Lemma.

Lemma 4.

Let p(1,2)p\in(1,2). Under the assumptions of Theorem 5, we have

Rp,n(w)Rp,n(wp)\displaystyle R_{p,n}(w)-R_{p,n}(w_{p}^{*}) 14p21ni=1nγp(|wp,XiYi|,|wwp,Xi|)+Rp,n(wp),wwp,\displaystyle\geq\frac{1}{4p^{2}}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen+\langle\nabla R_{p,n}(w_{p}^{*}),w-w_{p}^{*}\rangle, (12)
Rp(w)Rp(wp)\displaystyle R_{p}(w)-R_{p}(w_{p}^{*}) 4(p1)wwpHp2.\displaystyle\leq\frac{4}{(p-1)}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2}. (13)

As expected, while we do have the desired quadratic upper bound, the lower bound is much more cumbersome, and is only comparable to the second order Taylor expansion when |wwp,Xi||wp,XiYi|\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\leq\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert. What we need for the proof to go through is a high probability lower bound of order Ω(wwHp2)\Omega(\lVert w-w^{*}\rVert_{H_{p}}^{2}) on the first term in the lower bound (12). We obtain this in the following Proposition.

Proposition 2.

Let δ(0,1]\delta\in(0,1]. Under the assumptions of Theorem 5, if n196σp2(d+2log(4/δ))n\geq 196\sigma_{p}^{2}(d+2\log(4/\delta)), then with probability at least 1δ/21-\delta/2, for all wdw\in\mathbb{R}^{d},

1ni=1nγp(|wp,XiYi|,|wwp,Xi|)p8min{wwpHp2,ε2pwwpHpp},\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen\geq\frac{p}{8}\min\left\{\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2},\varepsilon^{2-p}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{p}\right\},

where εp2:=8σp3pcp(2p)/2cp\varepsilon^{p-2}\vcentcolon=8\sigma_{p}^{3-p}c_{p}^{(2-p)/2}\sqrt{c^{*}_{p}}, and cpc_{p} and cpc^{*}_{p} are as defined in Theorem 5.

Proof.

Let ε>0\varepsilon>0 and let T(0,)T\in(0,\infty) be a truncation parameter we will set later. Define

X~:=X𝟙[0,T](XHp1),\tilde{X}\vcentcolon=X\cdot\mathbbm{1}_{[0,T]}(\lVert X\rVert_{H_{p}^{-1}}),

and the constant β:=Tε\beta\vcentcolon=T\varepsilon. By Lemma 3.3 in [AKPS19], we have that γp(t,λx)min{λ2,λp}γp(t,x)\gamma_{p}(t,\lambda x)\geq\min\{\lambda^{2},\lambda^{p}\}\gamma_{p}(t,x) for all λ0\lambda\geq 0. Furthermore, it is straightforward to verify that γp(t,x)\gamma_{p}(t,x) is decreasing in tt and increasing in xx. Therefore, we have, for all wdw\in\mathbb{R}^{d},

1ni=1nγp(|wp,XiYi|,|wwp,Xi|)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen
min{ε2wwpHp2,εpwwpHpp}1ni=1nγp(|wp,XiYi|,|ε(wwp)wwpHp,Xi|)\displaystyle\geq\min\left\{\varepsilon^{-2}\lVert w-w_{p}\rVert_{H_{p}}^{2},\varepsilon^{-p}\lVert w-w_{p}\rVert_{H_{p}}^{p}\right\}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\left\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\right\rvert,\left\lvert\left\langle\frac{\varepsilon(w-w_{p}^{*})}{\lVert w-w_{p}^{*}\rVert_{H_{p}}},X_{i}\right\rangle\right\rvert\right\rparen
min{ε2wwpHp2,εpwwpHpp}infwHp=ε1ni=1nγp(|wp,XiYi|,|w,Xi|).\displaystyle\geq\min\left\{\varepsilon^{-2}\lVert w-w_{p}\rVert_{H_{p}}^{2},\varepsilon^{-p}\lVert w-w_{p}\rVert_{H_{p}}^{p}\right\}\cdot\inf_{\lVert w\rVert_{H_{p}}=\varepsilon}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w,X_{i}\rangle\rvert\right\rparen. (14)

The key idea to control the infimum in (14) is to truncate w,Xi\langle w,X_{i}\rangle from above by using the truncated vector X~\tilde{X}, and |wp,XiYi|\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert from below by forcing it to be greater than β\beta. By the monotonicity properties of γp\gamma_{p} discussed above, we get that

infwHp=ε1ni=1nγp(|wp,XiYi|,|w,Xi|)\displaystyle\inf_{\lVert w\rVert_{H_{p}}=\varepsilon}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w,X_{i}\rangle\rvert\right\rparen
infwHp=ε1ni=1nγp(max{|wp,XiYi|,β},|w,X~i|)\displaystyle\geq\inf_{\lVert w\rVert_{H_{p}}=\varepsilon}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}(\max\left\{\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\beta\right\},\lvert\langle w,\tilde{X}_{i}\rangle\rvert)
=ε2p2infwHp=11ni=1nmax{|wp,XiYi|,β}p2|w,X~i|2,\displaystyle=\frac{\varepsilon^{2}p}{2}\inf_{\lVert w\rVert_{H_{p}}=1}\frac{1}{n}\sum_{i=1}^{n}\max\left\{\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\beta\right\}^{p-2}\lvert\langle w,\tilde{X}_{i}\rangle\rvert^{2}, (15)

where the equality follows by the fact that with the chosen truncations, the second argument of γp\gamma_{p} is less than or equal to the first. It remains to lower bound the infimum in (15). Define

Z=max{|wp,XY|,β}(p2)/2X~.Z=\max\left\{\lvert\langle w_{p}^{*},X\rangle-Y\rvert,\beta\right\}^{(p-2)/2}\tilde{X}.

Removing the truncations and using our assumptions, we see that the components of ZZ have finite fourth moment. By Proposition 1 and the condition on nn, we get that with probability at least 1δ/21-\delta/2,

infwHp=11ni=1nmax{|wp,XiYi|,β}p2|w,X~i|2\displaystyle\inf_{\lVert w\rVert_{H_{p}}=1}\frac{1}{n}\sum_{i=1}^{n}\max\left\{\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\beta\right\}^{p-2}\lvert\langle w,\tilde{X}_{i}\rangle\rvert^{2}
=infwHp=11ni=1nw,Zi212infwHp=1E[w,Z2]\displaystyle=\inf_{\lVert w\rVert_{H_{p}}=1}\frac{1}{n}\sum_{i=1}^{n}\langle w,Z_{i}\rangle^{2}\geq\frac{1}{2}\inf_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\langle w,Z\rangle^{2}\right]
=12infwHp=1E[max{|wp,XY|,β}(p2)w,X~2]\displaystyle=\frac{1}{2}\inf_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\max\left\{\lvert\langle w_{p}^{*},X\rangle-Y\rvert,\beta\right\}^{(p-2)}\langle w,\tilde{X}\rangle^{2}\right]
12(1supwHp=1E[|wp,XY|p2w,X2(𝟙[0,β)(|wp,XY|)+𝟙(T,)(XHp1))])\displaystyle\geq\frac{1}{2}\left\lparen 1-\sup_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{p-2}\langle w,X\rangle^{2}\left\lparen\mathbbm{1}_{[0,\beta)}(\lvert\langle w_{p}^{*},X\rangle-Y\rvert)+\mathbbm{1}_{(T,\infty)}(\lVert X\rVert_{H_{p}^{-1}})\right\rparen\right]\right\rparen (16)

We now bound the supremum in (16). We have

supwHp=1E[|wp,XY|p2w,X2(𝟙[0,β)(|wp,XY|)+𝟙(T,)(XHp1))]\displaystyle\sup_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{p-2}\langle w,X\rangle^{2}\left\lparen\mathbbm{1}_{[0,\beta)}(\lvert\langle w_{p}^{*},X\rangle-Y\rvert)+\mathbbm{1}_{(T,\infty)}(\lVert X\rVert_{H_{p}^{-1}})\right\rparen\right]
supwHp=1{E[|wp,XY|2(p2)w,X4]1/2}(P(|wp,XY|<β)+P(XHp1>T))1/2\displaystyle\leq\sup_{\lVert w\rVert_{H_{p}}=1}\left\{\operatorname{E}\left[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}\langle w,X\rangle^{4}\right]^{1/2}\right\}\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\right\rparen^{1/2}
=(supwHp=1wL4,p2)(P(|wp,XY|<β)+P(XHp1>T))1/2\displaystyle=\left\lparen\sup_{\lVert w\rVert_{H_{p}}=1}\lVert w\rVert_{L^{4},p}^{2}\right\rparen\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\right\rparen^{1/2}
=σp(P(|wp,XY|<β)+P(XHp1>T))1/2,\displaystyle=\sigma_{p}\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\right\rparen^{1/2}, (17)

where the first inequality follows from Cauchy-Schwartz inequality, and the subsequent equalities by definitions of L4,p\lVert\cdot\rVert_{L^{4},p} and σp2\sigma^{2}_{p}. It remains to bound the tail probabilities. Recall that β=Tε\beta=T\varepsilon, so

P(|wp,XY|<β)\displaystyle\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen =P(|wp,XY|<Tε)\displaystyle=\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<T\varepsilon\right\rparen
=P(|wp,XY|1>(Tε)1)\displaystyle=\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{-1}>(T\varepsilon)^{-1}\right\rparen
=P(|wp,XY|2(p2)>(Tε)2(p2))\displaystyle=\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}>(T\varepsilon)^{2(p-2)}\right\rparen
E[|wp,XY|2(p2)](Tε)2(2p)\displaystyle\leq\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}](T\varepsilon)^{2(2-p)}
=cp(Tε)2(2p),\displaystyle=c^{*}_{p}(T\varepsilon)^{2(2-p)},

where we applied Markov’s inequality in the fourth line, and the last follows by definition of cpc^{*}_{p} in Theorem 5. Moreover by the finiteness of the second moment of the coordinates XjX^{j} of XX, we have

E[XHp12]=E[XTHp1X]=E[Tr(Hp1XXT)]=Tr(Hp1Σ)=cp\operatorname{E}\left[\lVert X\rVert_{H_{p}^{-1}}^{2}\right]=\operatorname{E}\left[X^{T}H_{p}^{-1}X\right]=\operatorname{E}\left[\operatorname{Tr}\left\lparen H^{-1}_{p}XX^{T}\right\rparen\right]=\operatorname{Tr}\left\lparen H_{p}^{-1}\Sigma\right\rparen=c_{p}

where Σ=E[XXT]\Sigma=\operatorname{E}[XX^{T}], and the last equality by definition of cpc_{p} in Theorem 5. By Markov’s inequality

P(|wp,XY|<Tε)+P(XHp1>T)cpT2(2p)ε2(2p)+cpT2.\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<T\varepsilon\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\leq c^{*}_{p}T^{2(2-p)}\varepsilon^{2(2-p)}+\frac{c_{p}}{T^{2}}.

Choosing

T:=(cpcp(2p))1/(62p),\displaystyle T\vcentcolon=\left\lparen\frac{c_{p}}{c^{*}_{p}(2-p)}\right\rparen^{1/(6-2p)}, ε2p:=18σp3pcpcp(2p)/2,\displaystyle\varepsilon^{2-p}\vcentcolon=\frac{1}{8\sigma_{p}^{3-p}\sqrt{c^{*}_{p}}\cdot c_{p}^{(2-p)/2}},

ensures that

σp(P(|wp,XY|<Tε)+P(XHp1>T))1/21/2.\sigma_{p}\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<T^{*}\varepsilon\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T^{*}\right\rparen\right\rparen^{1/2}\leq 1/2. (18)

Combining the inequalities (18), (17), (16), (15), and (14) yields the result. ∎

A detailed proof of Theorem 5, including the proof of Lemma 4, can be found in Appendix F.

Acknowledgments and Disclosure of Funding

We thank Nikita Zhivotovskiy, Sushant Sachdeva, and Deeksha Adil for feedback on the manuscript. MAE was partially supported by NSERC Grant [2019-06167], CIFAR AI Chairs program, and CIFAR AI Catalyst grant.

References

  • [AC11] Jean-Yves Audibert and Olivier Catoni “Robust Linear Least Squares Regression” In The Annals of Statistics, 2011 DOI: 10.1214/11-AOS918
  • [AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for p\ell_{p}-norm Regression” In Proceedings of the 2019 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019 DOI: 10.1137/1.9781611975482.86
  • [AKPS22] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Fast Algorithms for p\ell_{p}-Regression”, 2022 DOI: 10.48550/arXiv.2211.03963
  • [APS19] Deeksha Adil, Richard Peng and Sushant Sachdeva “Fast, Provably Convergent IRLS Algorithm for p-norm Linear Regression” In Advances in Neural Information Processing Systems, 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/hash/46c7cb50b373877fb2f8d5c4517bb969-Abstract.html
  • [Arc96] Miguel A. Arcones “The Bahadur-Kiefer Representation of LpL_{p} Regression Estimators” In Econometric Theory, 1996 URL: https://www.jstor.org/stable/3532831
  • [AS19] Deeksha Adil and Sushant Sachdeva “Faster pp-Norm Minimizing Flows, via Smoothed q-Norm Problems” In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019 DOI: 10.1137/1.9781611975994.54
  • [BCLL18] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee and Yuanzhi Li “An Homotopy Method for p\ell_{p} Regression Provably beyond Self-Concordance and in Input-Sparsity Time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018 DOI: 10.1145/3188745.3188776
  • [BRW92] Z.. Bai, C. Rao and Y. Wu “M-Estimation of Multivariate Linear Regression Parameters Under a Convex Discrepancy Function” In Statistica Sinica, 1992 URL: https://www.jstor.org/stable/24304129
  • [CG17] Olivier Catoni and Ilaria Giulini “Dimension-Free PAC-Bayesian Bounds for Matrices, Vectors, and Linear Least Squares Regression”, 2017 DOI: 10.48550/arXiv.1712.02747
  • [HKZ12] Daniel Hsu, Sham M. Kakade and Tong Zhang “Random Design Analysis of Ridge Regression” In Proceedings of the 25th Annual Conference on Learning Theory, 2012 URL: https://proceedings.mlr.press/v23/hsu12.html
  • [HS96] Xuming He and Qi-Man Shao “A General Bahadur Representation of M-estimators and Its Application to Linear Regression with Nonstochastic Designs” In The Annals of Statistics, 1996 DOI: 10.1214/aos/1032181172
  • [JLS22] Arun Jambulapati, Yang P. Liu and Aaron Sidford “Improved Iteration Complexities for Overconstrained pp-Norm Regression” In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, 2022 DOI: 10.1145/3519935.3519971
  • [KM15] Vladimir Koltchinskii and Shahar Mendelson “Bounding the Smallest Singular Value of a Random Matrix Without Concentration” In International Mathematics Research Notices, 2015 DOI: 10.1093/imrn/rnv096
  • [Kni98] Keith Knight “Limiting Distributions for L1L_{1} Regression Estimators under General Conditions” In The Annals of Statistics, 1998 DOI: 10.1214/aos/1028144858
  • [LC83] Erich L. Lehmann and George Casella “Theory of Point Estimation”, 1983
  • [LL05] P.. Lai and Stephen M.. Lee “An Overview of Asymptotic Properties of LpL_{p} Regression under General Classes of Error Distributions” In Journal of the American Statistical Association, 2005 URL: https://www.jstor.org/stable/27590567
  • [LM16] Guillaume Lecué and Shahar Mendelson “Performance of Empirical Risk Minimization in Linear Aggregation” In Bernoulli, 2016 DOI: 10.3150/15-BEJ701
  • [LM17] Guillaume Lecué and Shahar Mendelson “Regularization and the Small-Ball Method II: Complexity Dependent Error Rates” In Journal of Machine Learning Research, 2017 URL: http://jmlr.org/papers/v18/16-422.html
  • [LM17a] Guillaume Lecué and Shahar Mendelson “Sparse Recovery under Weak Moment Assumptions” In Journal of the European Mathematical Society, 2017 DOI: 10.4171/jems/682
  • [LM18] Guillaume Lecué and Shahar Mendelson “Regularization and the Small-Ball Method I: Sparse Recovery” In The Annals of Statistics, 2018 DOI: 10.1214/17-AOS1562
  • [Men14] Shahar Mendelson “Learning without Concentration” In Proceedings of The 27th Conference on Learning Theory, 2014 URL: https://proceedings.mlr.press/v35/mendelson14.html
  • [Men18] Shahar Mendelson “Learning without Concentration for General Loss Functions” In Probability Theory and Related Fields, 2018 DOI: 10.1007/s00440-017-0784-y
  • [Men21] Shahar Mendelson “Learning Bounded Subsets of LpL_{p} In IEEE Transactions on Information Theory, 2021 DOI: 10.1109/TIT.2021.3083553
  • [MG22] Jaouad Mourtada and Stéphane Gaïffas “An Improper Estimator with Optimal Excess Risk in Misspecified Density Estimation and Logistic Regression” In Journal of Machine Learning Research, 2022 URL: http://jmlr.org/papers/v23/20-782.html
  • [Mou22] Jaouad Mourtada “Exact Minimax Risk for Linear Least Squares, and the Lower Tail of Sample Covariance Matrices” In The Annals of Statistics, 2022 DOI: 10.1214/22-AOS2181
  • [MVZ22] Jaouad Mourtada, Tomas Vaškevičius and Nikita Zhivotovskiy “Distribution-free robust linear regression” In Mathematical Statistics and Learning, 2022 DOI: 10.4171/MSL/27
  • [Nie92] Wojciech Niemiro “Asymptotics for MM-Estimators Defined by Convex Minimization” In The Annals of Statistics, 1992 DOI: 10.1214/aos/1176348782
  • [OB21] Dmitrii M. Ostrovskii and Francis Bach “Finite-Sample Analysis of MM-Estimators Using Self-Concordance” In Electronic Journal of Statistics, 2021 DOI: 10.1214/20-EJS1780
  • [Oli16] Roberto Imbuzeiro Oliveira “The Lower Tail of Random Quadratic Forms with Applications to Ordinary Least Squares” In Probability Theory and Related Fields, 2016 DOI: 10.1007/s00440-016-0738-9
  • [Ron84] Arjen E. Ronner “Asymptotic Normality of pp-norm Estimators in Multiple Regression” In Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 1984 DOI: 10.1007/BF00531893
  • [RV15] Mark Rudelson and Roman Vershynin “Small Ball Probabilities for Linear Images of High-Dimensional Distributions” In International Mathematics Research Notices, 2015 DOI: 10.1093/imrn/rnu243
  • [Sau18] Adrien Saumard “On Optimality of Empirical Risk Minimization in Linear Aggregation” In Bernoulli, 2018 DOI: 10.3150/17-BEJ925
  • [Vaa98] A.. Vaart “Asymptotic Statistics”, 1998 DOI: 10.1017/CBO9780511802256
  • [VW96] Aad W. Van Der Vaart and Jon A. Wellner “Weak Convergence and Empirical Processes”, 1996 DOI: 10.1007/978-1-4757-2545-2
  • [VZ23] Tomas Vaškevičius and Nikita Zhivotovskiy “Suboptimality of Constrained Least Squares and Improvements via Non-Linear Predictors” In Bernoulli, 2023 DOI: 10.3150/22-BEJ1465
  • [Whi82] Halbert White “Maximum Likelihood Estimation of Misspecified Models” In Econometrica, 1982 DOI: 10.2307/1912526

Appendix A Differentiability of the risk

In this section, we rigorously establish the twice differentiability the risk under our assumptions. We start by showing that under a subset of our assumptions, the risk is differentiable everywhere on d\mathbb{R}^{d}.

Lemma 5.

Let p(1,)p\in(1,\infty) and assume that E[|Y|p]<\operatorname{E}[\lvert Y\rvert^{p}]<\infty and E[|Xj|p]<\operatorname{E}[\lvert X_{j}\rvert^{p}]<\infty for all j[d]j\in[d]. Then RpR_{p} is differentiable on d\mathbb{R}^{d}, and

Rp(w)=E[p(w,XY)].\nabla R_{p}(w)=\operatorname{E}[\nabla\ell_{p}(\langle w,X\rangle-Y)].
Proof.

Let wdw\in\mathbb{R}^{d}. We want to show that

limΔ0|Rp(w+Δ)Rp(w)E[p(w,XY),Δ]|Δ=0,\lim_{\Delta\to 0}\frac{\left\lvert R_{p}(w+\Delta)-R_{p}(w)-\operatorname{E}\left[\langle\nabla\ell_{p}(\langle w,X\rangle-Y),\Delta\rangle\right]\right\rvert}{\lVert\Delta\rVert}=0,

where, for convenience, we take the norm \lVert\cdot\rVert to be the Euclidean norm. Define the function ϕ(w,X,Y):=p(w,XY)\phi(w,X,Y)\vcentcolon=\ell_{p}(\langle w,X\rangle-Y) and note that by the chain rule ϕ\phi is differentiable as a function of ww on all of d\mathbb{R}^{d}. Now let (Δk)k=1n(\Delta_{k})_{k=1}^{n} be a sequence in d\mathbb{R}^{d} such that limkΔk=0\lim_{k\to\infty}\lVert\Delta_{k}\rVert=0. Then

limk|Rp(w+Δk)Rp(w)E[ϕ(w,X,Y),Δk]|Δk\displaystyle\lim_{k\to\infty}\frac{\left\lvert R_{p}(w+\Delta_{k})-R_{p}(w)-\operatorname{E}\left[\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right]\right\rvert}{\lVert\Delta_{k}\rVert}
=limk|E[ϕ(w+Δk,X,Y)ϕ(w,X,Y)ϕ(w,X,Y),Δk]|Δk\displaystyle=\lim_{k\to\infty}\frac{\left\lvert\operatorname{E}\left[\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)-\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right]\right\rvert}{\lVert\Delta_{k}\rVert}
limkE[|ϕ(w+Δk,X,Y)ϕ(w,X,Y)ϕ(w,X,Y),Δk|Δk].\displaystyle\leq\lim_{k\to\infty}\operatorname{E}\left[\frac{\left\lvert\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)-\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right\rvert}{\lVert\Delta_{k}\rVert}\right]. (19)

Our goal is to interchange the limit and expectation. For that, we will use the dominated convergence theorem. We construct our dominating function as follows. Let R:=supkΔkR\vcentcolon=\sup_{k\in\mathbb{N}}\lVert\Delta_{k}\rVert, and note that R<R<\infty since Δk0\lVert\Delta_{k}\rVert\to 0 as kk\to\infty. Then we have

|ϕ(w+Δk,X,Y)ϕ(w,X,Y)ϕ(w,X,Y),Δk|Δk\displaystyle\frac{\left\lvert\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)-\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right\rvert}{\lVert\Delta_{k}\rVert}
|ϕ(w+Δk,X,Y)ϕ(w,X,Y)|Δk+|ϕ(w,X,Y),Δk|Δk\displaystyle\leq\frac{\left\lvert\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)\right\rvert}{\lVert\Delta_{k}\rVert}+\frac{\left\lvert\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right\rvert}{\lVert\Delta_{k}\rVert}
01ϕ(w+tΔk,X,Y)𝑑t,ΔkΔk+ϕ(w,X,Y)\displaystyle\leq\frac{\left\langle\int_{0}^{1}\nabla\phi(w+t\Delta_{k},X,Y)dt,\Delta_{k}\right\rangle}{\lVert\Delta_{k}\rVert}+\lVert\nabla\phi(w,X,Y)\rVert
01ϕ(w+tΔk,X,Y)𝑑t+ϕ(w,X,Y)\displaystyle\leq\left\lVert\int_{0}^{1}\nabla\phi(w+t\Delta_{k},X,Y)dt\right\rVert+\lVert\nabla\phi(w,X,Y)\rVert
01ϕ(w+tΔk,X,Y)𝑑t+ϕ(w,X,Y)\displaystyle\leq\int_{0}^{1}\lVert\nabla\phi(w+t\Delta_{k},X,Y)\rVert dt+\lVert\nabla\phi(w,X,Y)\rVert
2supΔB(0,R)ϕ(w+Δ,X,Y)\displaystyle\leq 2\sup_{\Delta\in B(0,R)}\lVert\nabla\phi(w+\Delta,X,Y)\rVert
2p1XsupΔB(0,R)|w+Δ,XY|p1\displaystyle\leq\frac{2}{p-1}\lVert X\rVert\sup_{\Delta\in B(0,R)}\left\lvert\langle w+\Delta,X\rangle-Y\right\rvert^{p-1}
2p1XsupΔB(0,R)max{2p1,1}(|w,XY|p1+|Δ,X|p1)\displaystyle\leq\frac{2}{p-1}\lVert X\rVert\sup_{\Delta\in B(0,R)}\max\{2^{p-1},1\}\left\lparen\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}+\left\lvert\langle\Delta,X\rangle\right\rvert^{p-1}\right\rparen
=2pp1{|w,XY|p1X+Rp1Xp}=:g(X,Y),\displaystyle=\frac{2^{p}}{p-1}\left\{\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}\lVert X\rVert+R^{p-1}\lVert X\rVert^{p}\right\}=\vcentcolon g(X,Y),

where the second line follows by triangle inequality, the third from the fundamental theorem of calculus applied component-wise, the fourth by Cauchy-Schwartz inequality, the fifth by Jensen’s inequality and the convexity of the norm, and the eighth by the inequality |a+b|qmax{2q1,1}(|a|q+|b|q)\lvert a+b\rvert^{q}\leq\max\{2^{q-1},1\}(\lvert a\rvert^{q}+\lvert b\rvert^{q}) valid for q>0q>0. It remains to show that g(X,Y)g(X,Y) is integrable. We have

E[g(X,Y)]\displaystyle\operatorname{E}\left[g(X,Y)\right] =2pp1E[|w,XY|p1X+Rp1Xp]\displaystyle=\frac{2^{p}}{p-1}\operatorname{E}\left[\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}\lVert X\rVert+R^{p-1}\lVert X\rVert^{p}\right]
=2pp1{j=1dE[|w,XY|p1|Xj|]+Rp1E[(j=1d|Xj|)p]}\displaystyle=\frac{2^{p}}{p-1}\left\{\sum_{j=1}^{d}\operatorname{E}\left[\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}\lvert X^{j}\rvert\right]+R^{p-1}\operatorname{E}\left[\left\lparen\sum_{j=1}^{d}\lvert X^{j}\rvert\right\rparen^{p}\right]\right\}
2pp1{j=1dE[|w,XY|p]p1pE[|Xj|p]1/p+Rp1dpj=1dE[|Xj|p]}\displaystyle\leq\frac{2^{p}}{p-1}\left\{\sum_{j=1}^{d}\operatorname{E}\left[\lvert\langle w,X\rangle-Y\rvert^{p}\right]^{\frac{p-1}{p}}\operatorname{E}\left[\lvert X_{j}\rvert^{p}\right]^{1/p}+R^{p-1}d^{p}\sum_{j=1}^{d}\operatorname{E}\left[\lvert X^{j}\rvert^{p}\right]\right\}
<,\displaystyle<\infty,

where in the second line we used that the Euclidean norm is bounded by the 11-norm, in the third we used Holder’s inequality, and the last line follows from our assumptions. Applying the dominated convergence theorem, we interchange the limit and the expectation in (19). Recalling that ϕ\phi is differentiable finishes the proof. ∎

We now turn to the twice differentiability of the risk. We start with the easy case p[2,)p\in[2,\infty). The proof is very similar to that of Lemma 5 and we omit it here.

Lemma 6.

Let p[2,)p\in[2,\infty) and assume that E[|Y|p]<\operatorname{E}[\lvert Y\rvert^{p}]<\infty and E[|Xj|p]<\operatorname{E}[\lvert X_{j}\rvert^{p}]<\infty for all j[d]j\in[d]. Then RpR_{p} is twice differentiable on d\mathbb{R}^{d}, and

2Rp(w)=E[2p(w,XY)].\nabla^{2}R_{p}(w)=\operatorname{E}[\nabla^{2}\ell_{p}(\langle w,X\rangle-Y)].

The case p(1,2)p\in(1,2) is more complicated. The following lemma establishes the twice differentiability of the risk at its minimizer under a subset of the assumptions of Theorem 5.

Lemma 7.

Let p(1,2)p\in(1,2). Assume that P(|wp,XY|=0)=0\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert=0\right\rparen=0 and E[|wp,XY|p2(Xj)2]<\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{p-2}(X^{j})^{2}]<\infty for all j[d]j\in[d]. Then RpR_{p} is twice differentiable at wpw_{p}^{*} and

2Rp(wp)=E[2p(wp,XY)]\nabla^{2}R_{p}(w_{p}^{*})=\operatorname{E}[\nabla^{2}\ell_{p}(\langle w_{p}^{*},X\rangle-Y)]
Proof.

The difficulty in the proof compared to Lemma 5 and Lemma 6 stems from the fact that the loss is not twice differentiable at zero. We still rely on the dominated convergence theorem, but the construction of the dominating function is slightly more intricate. Using the setup of the proof of Lemma 5, and following the same line of arguments, we arrive at

limkRp(wp+Δk)Rp(wp)E[2ϕ(wp,X,Y)Δk]Δk\displaystyle\lim_{k\to\infty}\frac{\lVert\nabla R_{p}(w_{p}^{*}+\Delta_{k})-\nabla R_{p}(w_{p}^{*})-\operatorname{E}\left[\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\right]\rVert}{\lVert\Delta_{k}\rVert}
limkE[ϕ(wp+Δk,X,Y)ϕ(wp,X,Y)2ϕ(wp,X,Y)ΔkΔk],\displaystyle\leq\lim_{k\to\infty}\operatorname{E}\left[\frac{\lVert\nabla\phi(w_{p}^{*}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{*},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}\right], (20)

where we have used the fact that since P(|wp,XY|=0)=0\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert=0\right\rparen=0, ϕ(w,X,Y)\phi(w,X,Y) is almost surely twice differentiable at wpw_{p}^{*}. To finish the proof, it remains to construct a dominating function for the above sequence to justify the interchange of the limit and expectation. We consider two cases.

Case 1: Δk|wp,XY|/(2X)=:R(X,Y)\lVert\Delta_{k}\rVert\geq\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert/(2\lVert X\rVert)=\vcentcolon R(X,Y). Then we have

ϕ(wp+Δk,X,Y)ϕ(wp,X,Y)2ϕ(wp,X,Y)ΔkΔk\displaystyle\frac{\lVert\nabla\phi(w_{p}^{*}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{*},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}
ϕ(wp+Δk,X,Y)+ϕ(wp,X,Y)+2ϕ(wp,X,Y)ΔkΔk\displaystyle\leq\frac{\lVert\nabla\phi(w_{p}^{*}+\Delta_{k},X,Y)\rVert+\lVert\nabla\phi(w_{p}^{*},X,Y)\rVert+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}
(|wp+Δ,XY|p1+|wp,XY|p1)X(p1)Δk+2ϕ(wp,X,Y)op\displaystyle\leq\frac{\left\lparen\left\lvert\langle w_{p}^{*}+\Delta,X\rangle-Y\right\rvert^{p-1}+\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-1}\right\rparen\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert}+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}
2|wp,XY|p1X(p1)Δk+|wp,XY|p2X2+|Δk/Δk,X|p1X(p1)Δk2p\displaystyle\leq\frac{2\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-1}\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert}+\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}+\frac{\lvert\langle\Delta_{k}/\lVert\Delta_{k}\rVert,X\rangle\rvert^{p-1}\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert^{2-p}}
4|wp,XY|p2X2(p1)+|wp,XY|p2X2+Xp(p1)Δk2p\displaystyle\leq\frac{4\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}}{(p-1)}+\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}+\frac{\lVert X\rVert^{p}}{(p-1)\lVert\Delta_{k}\rVert^{2-p}}
7|wp,XY|p2X2(p1)\displaystyle\leq\frac{7\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}}{(p-1)}

where the second line follows by triangle inequality, the third by definition of the operator norm, the fourth by |a+b|q|a|q+|b|q\lvert a+b\rvert^{q}\leq\lvert a\rvert^{q}+\lvert b\rvert^{q} valid for q(0,1)q\in(0,1), and the fifth and sixth by Cauchy-Schwartz inequality and the assumed lower bound on Δk\lVert\Delta_{k}\rVert.

Case 2: Δk<R(X,Y)\lVert\Delta_{k}\rVert<R(X,Y). We start by noting that, for all ΔB(0,R(X,Y)):={xdx<R(X,Y)}\Delta\in B(0,R(X,Y))\vcentcolon=\left\{x\in\mathbb{R}^{d}\mid\lVert x\rVert<R(X,Y)\right\}, we have

|wp+Δ,XY||wp,XY||Δ,X||wp,XY|ΔX>|wp,XY|/2>0.\left\lvert\langle w_{p}^{*}+\Delta,X\rangle-Y\right\rvert\geq\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert-\left\lvert\langle\Delta,X\rangle\right\rvert\geq\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert-\lVert\Delta\rVert\lVert X\rVert>\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert/2>0.

Therefore ϕ(w,X,Y)\phi(w,X,Y) is twice differentiable on B(0,R(X,Y))B(0,R(X,Y)). Now

ϕ(wp+Δk,X,Y)ϕ(wp,X,Y)2ϕ(wp,X,Y)ΔkΔk\displaystyle\frac{\lVert\nabla\phi(w_{p}^{*}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{*},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}
ϕ(wp+Δk,X,Y)ϕ(wp,X,Y)+2ϕ(wp,X,Y)ΔkΔk\displaystyle\leq\frac{\lVert\nabla\phi(w_{p}^{*}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{*},X,Y)\rVert+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}
(012ϕ(wp+tΔk,X,Y)𝑑t)ΔkΔk+2ϕ(wp,X,Y)op\displaystyle\leq\frac{\left\lVert\left\lparen\int_{0}^{1}\nabla^{2}\phi(w_{p}^{*}+t\Delta_{k},X,Y)dt\right\rparen\Delta_{k}\right\rVert}{\lVert\Delta_{k}\rVert}+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}
012ϕ(w+tΔk,X,Y)𝑑top+2ϕ(wp,X,Y)op\displaystyle\leq\left\lVert\int_{0}^{1}\nabla^{2}\phi(w+t\Delta_{k},X,Y)dt\right\rVert_{op}+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}
012ϕ(w+tΔk,X,Y)op𝑑t+2ϕ(wp,X,Y)op\displaystyle\leq\int_{0}^{1}\lVert\nabla^{2}\phi(w+t\Delta_{k},X,Y)\rVert_{op}dt+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}
2supΔB(0,R(X,Y))2ϕ(wp+Δ,X,Y)op\displaystyle\leq 2\sup_{\Delta\in B(0,R(X,Y))}\lVert\nabla^{2}\phi(w_{p}^{*}+\Delta,X,Y)\rVert_{op}
2X22supΔB(0,R(X,Y))|wp+Δ,XY|p2\displaystyle\leq 2\lVert X\rVert_{2}^{2}\sup_{\Delta\in B(0,R(X,Y))}\left\lvert\langle w_{p}^{*}+\Delta,X\rangle-Y\right\rvert^{p-2}
4|wp,XY|p2X22\displaystyle\leq 4\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert_{2}^{2}

where the second line follows from the triangle inequality, the third follows from the twice differentiability of ϕ\phi on B(0,R(X,Y))B(0,R(X,Y)) and the fundamental theorem of calculus applied component-wise, the fifth by Jensen’s inequality, and the last by definition of R(X,Y)R(X,Y) and the above lower bound. We therefore define our dominating function by

g(X,Y):=8|wp,XY|p2X22.g(X,Y)\vcentcolon=8\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert_{2}^{2}.

It is then immediate from our assumptions that g(X,Y)g(X,Y) is integrable. Interchanging the limit and the expectation in (20) and recalling that ϕ\phi is almost surely twice differentiable at wpw_{p}^{*} finishes the proof.

Appendix B Proof of Lemma 1

We start with the claim that ρ0\rho_{0} is upper-semicontinuous. We want to show that for any wdw\in\mathbb{R}^{d} and any sequence (wk)k=1(w_{k})_{k=1}^{\infty} converging to ww (in the norm topology)

lim supkρ0(wk)ρ0(w).\limsup_{k\to\infty}\rho_{0}(w_{k})\leq\rho_{0}(w).

Fix a wdw\in\mathbb{R}^{d} and let (wk)k=1(w_{k})_{k=1}^{\infty} be a sequence in d\mathbb{R}^{d} satisfying limkwwk=0\lim_{k\to\infty}\lVert w-w_{k}\rVert=0, where for convenience we take \lVert\cdot\rVert to be the Euclidean norm on d\mathbb{R}^{d}. Then we have by (reverse) Fatou’s Lemma

lim supkρ0(wk)=lim supkE[𝟙{0}(wk,X)]E[lim supk𝟙{0}(wk,X)].\limsup_{k\to\infty}\rho_{0}(w_{k})=\limsup_{k\to\infty}\operatorname{E}\left[\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)\right]\\ \leq\operatorname{E}\left[\limsup_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)\right]. (21)

Now we bound the inner limsup pointwise. We split this task in two cases. If w,X=0\langle w,X\rangle=0, then

lim supk𝟙{0}(wk,X)1=𝟙{0}(w,X).\limsup_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)\leq 1=\mathbbm{1}_{\{0\}}(\langle w,X\rangle). (22)

Otherwise we have δ:=|w,X|>0\delta\vcentcolon=\lvert\langle w,X\rangle\rvert>0. But then, by the convergence of (wk)k=1(w_{k})_{k=1}^{\infty} to ww, there exists a KK\in\mathbb{N} such that for all kKk\geq K we have wkw<δ/(2X)\lVert w_{k}-w\rVert<\delta/(2\lVert X\rVert). This implies that for all kKk\geq K

|wk,X|=|w,Xwwk,X||w,X||wwk,X|δwkw2Xδ/2>0.\lvert\langle w_{k},X\rangle\rvert=\lvert\langle w,X\rangle-\langle w-w_{k},X\rangle\rvert\geq\lvert\langle w,X\rangle\rvert-\lvert\langle w-w_{k},X\rangle\rvert\geq\delta-\lVert w_{k}-w\rVert_{2}\lVert X\rVert\geq\delta/2>0.

We conclude that

lim supk𝟙{0}(wk,X)=limk𝟙{0}(wk,X)=0=𝟙{0}(w,X).\limsup_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)=\lim_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)=0=\mathbbm{1}_{\{0\}}(\langle w,X\rangle). (23)

Combining (21), (22), and (23) proves the upper-semicontinuity of ρ0\rho_{0}. Essentially the same proof shows the lower-semicontinuity of ρq(,κ)\rho_{q}(\cdot,\kappa) for any κ0\kappa\geq 0; we omit it here.

For the remaining claims, first notice that the function ρ0\rho_{0} is scale invariant, i.e. for all wdw\in\mathbb{R}^{d} and all cc\in\mathbb{R}, we have ρ0(cw)=ρ0(w)\rho_{0}(cw)=\rho_{0}(w). Therefore

supwd{0}ρ0(w)=supw𝕊d1ρ0(w),\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{0}(w)=\sup_{w\in\mathbb{S}^{d-1}}\rho_{0}(w),

where 𝕊d1\mathbb{S}^{d-1} is the Euclidean unit sphere. By assumption on the random vector XX, we know that ρ0(w)<1\rho_{0}(w)<1 for all w𝕊d1w\in\mathbb{S}^{d-1}. Furthermore since ρ0\rho_{0} is upper semicontinuous, and 𝕊d1\mathbb{S}^{d-1} is compact, ρ0\rho_{0} attains its supremum on 𝕊d1\mathbb{S}^{d-1} at some point w0𝕊d1w_{0}\in\mathbb{S}^{d-1}. From this we conclude that

ρ=supwd{0}ρ0(w)=ρ0(w0)<1.\rho=\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{0}(w)=\rho_{0}(w_{0})<1.

Finally, we turn to the claim about ρq\rho_{q}. Since E[|Xj|q]<\operatorname{E}[\lvert X^{j}\rvert^{q}]<\infty, the function Lq\lVert\cdot\rVert_{L^{q}} is a norm on d\mathbb{R}^{d}, from which it follows that ρq(w,κ)\rho_{q}(w,\kappa) is also scale invariant for any κ\kappa. Therefore

infwd{0}ρq(w,κ)=infwSqρq(w,κ),\inf_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{q}(w,\kappa)=\inf_{w\in S_{q}}\rho_{q}(w,\kappa),

where SqS_{q} is the unit sphere of the norm Lq\lVert\cdot\rVert_{L^{q}}. Now fix κ[0,1)\kappa\in[0,1). We claim that ρq(w,κ)>0\rho_{q}(w,\kappa)>0 for all wSqw\in S_{q}. Suppose not. Then there exists a wSqw\in S_{q} such that |w,X|κ\lvert\langle w,X\rangle\rvert\leq\kappa with probability 11, but then we get the contradiction

1=wLq=E[|w,X|q]1/qκ<1.1=\lVert w\rVert_{L^{q}}=\operatorname{E}[\lvert\langle w,X\rangle\rvert^{q}]^{1/q}\leq\kappa<1.

therefore ρq(w,κ)>0\rho_{q}(w,\kappa)>0 for all wSqw\in S_{q}. Now since ρq(,κ)\rho_{q}(\cdot,\kappa) is lower-semicontinuous, and SqS_{q} is compact, ρq(,κ)\rho_{q}(\cdot,\kappa) attains its infimum on SqS_{q} at some point wqSqw_{q}\in S_{q}. From this we conclude

infwd{0}ρq(w,κ)=ρq(wq,κ)>0.\inf_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{q}(w,\kappa)=\rho_{q}(w_{q},\kappa)>0.

Appendix C Proof of Theorem 3

Fix p(1,)p\in(1,\infty), and let w^:=w^p\hat{w}\vcentcolon=\hat{w}_{p}. Our goal will be to upper bound the probability P(w^w)\operatorname{P}\left\lparen\hat{w}\neq w^{*}\right\rparen. By assumption, we have that Y=w,XY=\langle w^{*},X\rangle, so that Yi=w,XiY_{i}=\langle w^{*},X_{i}\rangle for all i[n]i\in[n]. Since w^\hat{w} minimizes the empirical risk, we must also have w^,Xi=Yi=w,Xi\langle\hat{w},X_{i}\rangle=Y_{i}=\langle w^{*},X_{i}\rangle for all i[n]i\in[n]. Let An×dA\in\mathbb{R}^{n\times d} denote the matrix whose ii-th row is XiX_{i}. Then we have the following implications.

w^ww^w,Xi=0i[n]wd{0}Aw=0rowrank(A)<d.\hat{w}\neq w^{*}\Rightarrow\langle\hat{w}-w^{*},X_{i}\rangle=0\,\,\forall i\in[n]\Rightarrow\exists w\in\mathbb{R}^{d}\setminus\{0\}\mid Aw=0\Leftrightarrow\operatorname{rowrank}(A)<d. (24)

Let r:=rowrank(A)r\vcentcolon=\operatorname{rowrank}(A). We claim the following equivalence

rowrank(A)<dS[n]|S|=d1i[n]SXispan({XkkS}).\operatorname{rowrank}(A)<d\Leftrightarrow\exists S\subset[n]\mid\lvert S\rvert=d-1\land\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen. (25)

Indeed the implication ()(\Leftarrow) follows by definition of the rowrank of AA. For the implication ()(\Rightarrow), by definition, {Xii[n]}\left\{X_{i}\mid i\in[n]\right\} is a spanning set for the row space of AA, therefore it can be reduced to a basis of it {XkkS1}\left\{X_{k}\mid k\in S_{1}\right\} for some indices S1[n]S_{1}\subset[n] with |S1|=r\lvert S_{1}\rvert=r. If r=d1r=d-1, then the choice S=S1S=S_{1} satisfies the right side of (25). Otherwise, let S2[n]S1S_{2}\subset[n]\setminus S_{1} with |S2|=d1r\lvert S_{2}\rvert=d-1-r. Such a subset exists since by assumption nd>d1n\geq d>d-1. Then the set S:=S1S2S\vcentcolon=S_{1}\cup S_{2} satisfies the right side of (25). Combining (24) and (25) we arrive at:

P(w^w)\displaystyle\operatorname{P}\left\lparen\hat{w}\neq w^{*}\right\rparen P(S[n]|S|=d1{i[n]SXispan({XkkS})})\displaystyle\leq\operatorname{P}\left\lparen\bigcup_{\begin{subarray}{c}S\subset[n]\\ \lvert S\rvert=d-1\end{subarray}}\left\{\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen\right\}\right\rparen
S[n]|S|=d1P(i[n]SXispan({XkkS}))\displaystyle\leq\sum_{\begin{subarray}{c}S\subset[n]\\ \lvert S\rvert=d-1\end{subarray}}\operatorname{P}\left\lparen\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen\right\rparen (26)

where the second inequality follows from the union bound. We now bound each of the terms of the sum. Fix S={i1,,id1}[n]S=\left\{i_{1},\dotsc,i_{d-1}\right\}\subset[n] with |S|=d1\lvert S\rvert=d-1. Let ZS=n((Xij)j=1d1)Z_{S}=n((X_{i_{j}})_{j=1}^{d-1}) be a non-zero vector orthogonal to span({XkkS})\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen. Such a vector must exist since dim(span({XkkS}))<d\dim(\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen)<d; see Lemma 8 below for an explicit construction of the function nn. Denote by PZSP_{Z_{S}} the distribution of ZSZ_{S} and P(Xi)i[n]S=i=1nd1PXP_{(X_{i})_{i\in[n]\setminus S}}=\prod_{i=1}^{n-d-1}P_{X} the distribution of (Xi)i[n]S(X_{i})_{i\in[n]\setminus S}, where PXP_{X} is the distribution of XX. Note that since ZSZ_{S} is a function of (Xij)j=1d(X_{i_{j}})_{j=1}^{d} only, it is independent of (Xi)i[n]S(X_{i})_{i\in[n]\setminus S}. In particular, the joint distribution of (ZS,(Xi)i[n]S)(Z_{S},(X_{i})_{i\in[n]\setminus S}) is given by the product PZS×P(Xi)i[n]SP_{Z_{S}}\times P_{(X_{i})_{i\in[n]\setminus S}}. Now if Xispan({XkkS})X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen, then by definition of ZSZ_{S}, ZS,Xi=0\langle Z_{S},X_{i}\rangle=0. Therefore

P(i[n]SXispan({XkkS}))P(i[n]SZS,Xi=0)\displaystyle\operatorname{P}\left\lparen\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen\right\rparen\leq\operatorname{P}\left\lparen\forall i\in[n]\setminus S\,\,\langle Z_{S},X_{i}\rangle=0\right\rparen
=E[i[n]S𝟙{0}(ZS,Xi)]\displaystyle=\operatorname{E}\left[\prod_{i\in[n]\setminus S}\mathbbm{1}_{\left\{0\right\}}(\langle Z_{S},X_{i}\rangle)\right]
={i[n]S𝟙{0}(yS,xi)}PZS(dzS)P(Xi)i[n]S(d(xi)i[n]S)\displaystyle=\int\left\{\prod_{i\in[n]\setminus S}\mathbbm{1}_{\left\{0\right\}}(\langle y_{S},x_{i}\rangle)\right\}P_{Z_{S}}(dz_{S})P_{(X_{i})_{i\in[n]\setminus S}}(d(x_{i})_{i\in[n]\setminus S})
=PZS(dys){i[n]S𝟙{0}(yS,xi)PX(dxi)}\displaystyle=\int P_{Z_{S}}(dy_{s})\left\{\prod_{i\in[n]\setminus S}\int\mathbbm{1}_{\left\{0\right\}}(\langle y_{S},x_{i}\rangle)P_{X}(dx_{i})\right\}
={i[n]SP(zS,X=0)}PZS(dys)\displaystyle=\int\left\{\prod_{i\in[n]\setminus S}\operatorname{P}\left\lparen\langle z_{S},X\rangle=0\right\rparen\right\}P_{Z_{S}}(dy_{s})
={i[n]Sρ0(ys)}PZS(dys)\displaystyle=\int\left\{\prod_{i\in[n]\setminus S}\rho_{0}(y_{s})\right\}P_{Z_{S}}(dy_{s})
ρnd+1,\displaystyle\leq\rho^{n-d+1}, (27)

where in the third line we used the independence of ZSZ_{S} and (Xi)i[n]S(X_{i})_{i\in[n]\setminus S}, in the fourth we used the independence of the (Xi)iS(X_{i})_{i\in\setminus S}, in the sixth we used the definition of ρ0\rho_{0} in 5, and in the last line we used the fact that zS0z_{S}\neq 0 and the definition of ρ\rho in Lemma 1. Combining the inequalities (26) and (27) yields the result. ∎

Lemma 8.

Let m{1,,d1}m\in\left\{1,\dotsc,d-1\right\} and let (xj)j=1m(x_{j})_{j=1}^{m} be a sequence of points in d\mathbb{R}^{d}. Denote by Am×dA\in\mathbb{R}^{m\times d} the matrix whose jj-th row is xjx_{j} and let A+A^{+} be its pseudo-inverse. Let (bi)i=1d(b_{i})_{i=1}^{d} be an ordered basis of d\mathbb{R}^{d}, and define

k:=min{i[n](IA+A)bi0}\displaystyle k\vcentcolon=\min\left\{i\in[n]\mid(I-A^{+}A)b_{i}\neq 0\right\}
n((xj)j=1m):=(IA+A)bk\displaystyle n((x_{j})_{j=1}^{m})\vcentcolon=(I-A^{+}A)b_{k}

Then n((xj)j=1m)n((x_{j})_{j=1}^{m}) is non-zero and is orthogonal to span({xjj[m]})\operatorname{span}\left\lparen\left\{x_{j}\mid j\in[m]\right\}\right\rparen.

Proof.

We start by showing that kk is well defined. First note that IA+AI-A^{+}A is the orthogonal projector onto the kernel of AA, which is non-trivial since dim(ker(A))=ddim(Im(A))dm1\dim(\ker(A))=d-\dim(\operatorname{Im}(A))\geq d-m\geq 1. Now we claim that there exists an i[d]i\in[d] such that (IA+A)bi0(I-A^{+}A)b_{i}\neq 0. Suppose not, then for any wdw\in\mathbb{R}^{d}, we have (IA+A)w=(IA+A)(i=1dcibi)=i=1dci(IA+A)bi=0(I-A^{+}A)w=(I-A^{+}A)(\sum_{i=1}^{d}c_{i}b_{i})=\sum_{i=1}^{d}c_{i}(I-A^{+}A)b_{i}=0, implying that IA+A=0I-A^{+}A=0, but this contradicts the non-triviality of ker(A)\ker(A). This proves that kk is well-defined, which in turn proves that n((xj)j=1m)0n((x_{j})_{j=1}^{m})\neq 0. It remains to prove the orthogonality claim. Let vspan({xjj[m]})v\in\operatorname{span}\left\lparen\left\{x_{j}\mid j\in[m]\right\}\right\rparen. Then there exists coefficients cmc\in\mathbb{R}^{m} such that v=ATcv=A^{T}c. Therefore

v,n((xj)j=1m)=ATc,n((xj)j=1m)=cTA(IAA+)bk=0,\langle v,n((x_{j})_{j=1}^{m})\rangle=\langle A^{T}c,n((x_{j})_{j=1}^{m})\rangle=c^{T}A(I-AA^{+})b_{k}=0,

where the last equality holds since (IAA+)bkker(A)(I-AA^{+})b_{k}\in\ker(A). ∎

Appendix D Missing proofs for Theorem 1

This section contains the proofs of Lemma 2 and Corollary 1 we used in the proof of Theorem 1.

D.1 Proof of Lemma 2

We compute the expectation:

E[Rp,n(wp)Hp12]\displaystyle\operatorname{E}\left[\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}^{2}\right] =E[n1p(wp,XiYi)Hp12]\displaystyle=\operatorname{E}\left[\lVert n^{-1}\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})\rVert_{H_{p}^{-1}}^{2}\right]
=n2i=1nE[p(wp,XiYi)Hp12]\displaystyle=n^{-2}\sum_{i=1}^{n}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})\rVert_{H_{p}^{-1}}^{2}\right]
+2n2i=1nj=1i1E[p(wp,XiYi)],E[p(wp,XjYj)]Hp1\displaystyle\quad+2n^{-2}\sum_{i=1}^{n}\sum_{j=1}^{i-1}\langle\operatorname{E}[\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})],\operatorname{E}[\nabla\ell_{p}(\langle w_{p}^{*},X_{j}\rangle-Y_{j})]\rangle_{H_{p}^{-1}}
=n1E[p(wp,XY)Hp12]\displaystyle=n^{-1}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]

where in the second line we expanded the inner product of the sums into its n2n^{2} terms, used linearity of expectation, and used the independence of the samples to take the expectation inside the inner product. In the last line, we used the fact that the samples are identically distributed to simplify the first term. For the second term, we used the fact that the expectation of the gradient of the loss at the risk minimizer vanishes. Applying Markov’s inequality finishes the proof. ∎

D.2 Proof of Corollary 1

We have

ww2H2,n2\displaystyle\lVert w-w_{2}^{*}\rVert_{H_{2,n}}^{2} =(ww2)TH2,n(ww2)\displaystyle=(w-w_{2}^{*})^{T}H_{2,n}(w-w_{2}^{*})
=1ni=1n(ww2)T2p(w2,XiYi)(ww2)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(w-w_{2}^{*})^{T}\nabla^{2}\ell_{p}(\langle w_{2}^{*},X_{i}\rangle-Y_{i})(w-w_{2}^{*})
=1ni=1nww2,Xi2.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle w-w_{2}^{*},X_{i}\rangle^{2}.

Now by assumption, the components of the vector XX have finite fourth moment so that applying Proposition 1 and using the condition on nn yields the result. ∎

Appendix E Detailed proof of Theorem 4

This section contains the proof of Lemma 3 as well as that of Theorem 4.

E.1 Proof of Lemma 3

By Lemma 2.5 in [AKPS22], we have for all t,st,s\in\mathbb{R}

p(t)p(s)p(s)(ts)18(p1)p′′(s)(ts)2.\ell_{p}(t)-\ell_{p}(s)-\ell_{p}^{\prime}(s)(t-s)\geq\frac{1}{8(p-1)}\ell_{p}^{\prime\prime}(s)(t-s)^{2}.

Recall that by the chain rule

p(w,XY)\displaystyle\nabla\ell_{p}(\langle w,X\rangle-Y) =p(w,XY)X\displaystyle=\ell_{p}^{\prime}(\langle w,X\rangle-Y)X 2p(w,XY)=p′′(w,XY)XXT.\displaystyle\nabla^{2}\ell_{p}(\langle w,X\rangle-Y)=\ell_{p}^{\prime\prime}(\langle w,X\rangle-Y)XX^{T}.

Replacing tt and ss by w,XiYi\langle w,X_{i}\rangle-Y_{i} and wp,XiYi\langle w_{p}^{*},X_{i}\rangle-Y_{i} respectively, and using the formulas for the gradient and Hessian we arrive at

p(w,XiYi)p(wp,XiYi)18(p1)(wwp)T2p(wp,XiYi)(wwp)+p(wp,XiYi),wwp\ell_{p}(\langle w,X_{i}\rangle-Y_{i})-\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})\geq\frac{1}{8(p-1)}(w-w_{p}^{*})^{T}\nabla^{2}\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})(w-w_{p}^{*})\\ +\langle\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i}),w-w_{p}^{*}\rangle

Averaging over i[n]i\in[n] yields the first inequality. The proof of the second inequality proceeds in the same way and uses instead the upper bound of Lemma 2.5 in [AKPS22]. We omit it here. ∎

E.2 Proof of Theorem 4

We proceed similarly to the proof of Theorem 1. By definition of the empirical risk minimizer, we have the upper bound

Rp,n(w^p)Rp,n(wp)0.R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\leq 0. (28)

Using (10) from Lemma 3 and the Cauchy-Schwarz inequality, we obtain the pointwise lower bound

Rp,n(w^p)Rp,n(wp)Rp,n(wp)Hp1w^pwpHp+18(p1)w^pwpHp,n2.R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\geq-\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}+\frac{1}{8(p-1)}\lVert\hat{w}_{p}-w_{p}^{*}\rVert^{2}_{H_{p,n}}. (29)

Using Lemma 2 we have that, with probability at least 1δ/21-\delta/2,

Rp,n(wp)Hp12E[p(wp,XY)Hp12]/(nδ).\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\leq\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}. (30)

It remains to control w^pwpHp,n2\lVert\hat{w}_{p}-w_{p}^{*}\rVert^{2}_{H_{p,n}} from below. Define the random vector

Z=|wp,XY|(p2)/2XZ=\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{(p-2)/2}X

Then, for any wdw\in\mathbb{R}^{d}, we have

wwpHp,n2\displaystyle\lVert w-w_{p}^{*}\rVert_{H_{p,n}}^{2} =(wwp)THp,n(wwp)\displaystyle=(w-w_{p}^{*})^{T}H_{p,n}(w-w_{p}^{*})
=1ni=1n(wwp)T2p(wp,XiYi)(wwp)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(w-w_{p}^{*})^{T}\nabla^{2}\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})(w-w_{p}^{*})
=1ni=1nwwp,Zi2\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle w-w_{p}^{*},Z_{i}\rangle^{2}

By assumption, the components of the random vector ZZ have finite fourth moment. Applying Proposition 1, and using the condition on nn assumed in the statement of Theorem 4, we get that, with probability at least 1δ/21-\delta/2, for all wdw\in\mathbb{R}^{d},

wwpHp,n212wwpHp2.\lVert w-w_{p}^{*}\rVert_{H_{p,n}}^{2}\geq\frac{1}{2}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2}. (31)

Combining (30) and (31) with (29) gives that with probability at least 1δ1-\delta,

Rp,n(w^p)Rp,n(wp)2E[p(wp,XY)Hp12]/(nδ)w^pwpHp+116(p1)w^pwpHp2.R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\geq-\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}\,\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}\\ +\frac{1}{16(p-1)}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}. (32)

Further combining (32) with (28) and rearranging yields that with probability at least 1δ1-\delta

w^pwpHp2512p2E[p(wp,XY)Hp12]nδ\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\frac{512p^{2}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta} (33)

The last step is to bound the excess risk of the empirical risk minimizer using the bound (33) and (11) from Lemma 3. For that, we control the LpL^{p} norm term in (11) as follows

ppw^pwpLpp\displaystyle p^{p}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{L^{p}}^{p} =(p2w^pwpLp2w^pwpHp2w^pwpHp2)p/2\displaystyle=\left\lparen p^{2}\frac{\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{L^{p}}^{2}}{\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\right\rparen^{p/2}
(p2supwd{0}{wLp2wHp2}w^pwpHp2)p/2\displaystyle\leq\left\lparen p^{2}\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\left\{\frac{\lVert w\rVert_{L^{p}}^{2}}{\lVert w\rVert_{H_{p}}^{2}}\right\}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\right\rparen^{p/2}
=(p2cp2w^pwpHp2)p/2.\displaystyle=\left\lparen p^{2}c_{p}^{2}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\right\rparen^{p/2}. (34)

Combining (33), (11), and (34) yields the result. ∎

Appendix F Detailed proof of Theorem 5

This section contains the proof of Lemma 4 as well as that of Theorem 5.

F.1 Proof of Lemma 4

Both inequalities follow from Lemma 4.5 in [AKPS19]. (12) follows from a straightforward calculation using the lower bound of Lemma 4.5 in [AKPS19]; we omit it here. The upper bound requires a bit more work. We have by the quoted Lemma

p(t)p(s)p(s)(ts)4p(p1)γp(|s|,|ts|).\ell_{p}(t)-\ell_{p}(s)-\ell_{p}^{\prime}(s)(t-s)\leq\frac{4}{p(p-1)}\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert).

Now assume that |s|>0\lvert s\rvert>0. If |ts||s|\lvert t-s\rvert\leq\lvert s\rvert, we have

γp(|s|,|ts|)=p2|s|p2(ts)2|s|p2(ts)2=p′′(s)(ts)2.\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert)=\frac{p}{2}\lvert s\rvert^{p-2}{(t-s)^{2}}\leq\lvert s\rvert^{p-2}(t-s)^{2}=\ell_{p}^{\prime\prime}(s)(t-s)^{2}.

Otherwise, if |ts|>|s|\lvert t-s\rvert>\lvert s\rvert, then we have

γp(|s|,|ts|)=|ts|p(1p/2)|s|p(ts)2|ts|p2|s|p2(ts)2=p′′(s)(ts)2.\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert)=\lvert t-s\rvert^{p}-(1-p/2)\lvert s\rvert^{p}\leq(t-s)^{2}\lvert t-s\rvert^{p-2}\leq\lvert s\rvert^{p-2}(t-s)^{2}=\ell_{p}^{\prime\prime}(s)(t-s)^{2}.

Therefore in both cases we have γp(|s|,|ts|)p′′(s)(ts)2\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert)\leq\ell_{p}^{\prime\prime}(s)(t-s)^{2} as long as |s|>0\lvert s\rvert>0. Replacing tt and ss by w,XY\langle w,X\rangle-Y and wp,XY\langle w_{p}^{*},X\rangle-Y respectively we get, on the event that wp,XY0\langle w_{p}^{*},X\rangle-Y\neq 0

p(w,XY)p(wp,XY)p(wp,XY),wwp4p(p1)wwp2p(wp,XY)\ell_{p}(\langle w,X\rangle-Y)-\ell_{p}(\langle w_{p}^{*},X\rangle-Y)-\langle\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y),w-w_{p}^{*}\rangle\leq\frac{4}{p(p-1)}\lVert w-w_{p}^{*}\rVert_{\nabla^{2}\ell_{p}(\langle w_{p}^{*},X\rangle-Y)}

Recalling that by assumption P(wp,XY0)=1\operatorname{P}\left\lparen\langle w_{p}^{*},X\rangle-Y\neq 0\right\rparen=1, taking expectation of both sides, and bounding 1/p11/p\leq 1 finishes the proof of (13). ∎

F.2 Proof of Theorem 5

We follow the same proof strategy as the one used in the proofs of Theorems 1 and 4. By definition of the empirical risk minimizer, we have

Rp,n(w^p)Rp,n(wp)0.R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\leq 0. (35)

Using (12) from Lemma 4 and the Cauchy-Schwarz inequality, we have the lower bound

Rp,n(w^p)Rp,n(wp)Rp,n(wp)Hp1w^pwpHp+14p21ni=1nγp(|wp,XiYi|,|wwp,Xi|)R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\geq-\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}\\ +\frac{1}{4p^{2}}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen (36)

Using Lemma 2 we have that, with probability at least 1δ/21-\delta/2,

Rp,n(wp)Hp12E[p(wp,XY)Hp12]/(nδ).\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\leq\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}. (37)

On the other hand, by Proposition 2, we have with probability 1δ/21-\delta/2, for all wdw\in\mathbb{R}^{d},

1ni=1nγp(|wp,XiYi|,|wwp,Xi|)p8min{wwpHp2,ε2pwwpHpp},\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen\geq\frac{p}{8}\min\left\{\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2},\varepsilon^{2-p}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{p}\right\}, (38)

where ε\varepsilon is as defined in Proposition 2. We now consider two cases. If w^pwpHp2ε2pw^pwpHpp\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\varepsilon^{2-p}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{p}, then combining (35), (36), (37), and (38) gives

w^pwpHp28192E[p(wp,XY)Hp12]nδ.\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}. (39)

Otherwise, w^pwpHp2>ε2pw^pwpHpp\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}>\varepsilon^{2-p}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{p}, then again combining (35), (36), (37), and (38) gives

w^pwpHp2(8192E[p(wp,XY)Hp12]ε2(p2)nδ)1/(p1)\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\left\lparen\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]\varepsilon^{2(p-2)}}{n\delta}\right\rparen^{1/(p-1)} (40)

In either case, we have, using (39) and (40), with probability at least 1δ1-\delta,

w^pwpHp28192E[p(wp,XY)Hp12]nδ+(8192E[p(wp,XY)Hp12]ε2(p2)nδ)1/(p1).\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\\ +\left\lparen\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]\varepsilon^{2(p-2)}}{n\delta}\right\rparen^{1/(p-1)}.

Combining this last inequality with (13) from Lemma 4 finishes the proof. ∎