Optimal Excess Risk Bounds for Empirical Risk Minimization on $p$ -Norm Linear Regression

Ayoub El Hanchi
University of Toronto & Vector Institute
[email protected] &Murat A. Erdogdu
University of Toronto & Vector Institute
[email protected]

Abstract

We study the performance of empirical risk minimization on the $p$ -norm linear regression problem for $p\in(1,\infty)$ . We show that, in the realizable case, under no moment assumptions, and up to a distribution-dependent constant, $O(d)$ samples are enough to exactly recover the target. Otherwise, for $p\in[2,\infty)$ , and under weak moment assumptions on the target and the covariates, we prove a high probability excess risk bound on the empirical risk minimizer whose leading term matches, up to a constant that depends only on $p$ , the asymptotically exact rate. We extend this result to the case $p\in(1,2)$ under mild assumptions that guarantee the existence of the Hessian of the risk at its minimizer.

1 Introduction

Real-valued linear prediction is a fundamental problem in machine learning. Traditionally, the square loss has been the default choice for this problem. The performance of empirical risk minimization (ERM) on linear regression under the square loss, as measured by the excess risk, has been studied extensively both from an asymptotic [Whi82, LC83, Vaa98] and a non-asymptotic point of view [AC11, HKZ12, Oli16, LM16, Sau18, Mou22]. An achievement of the last decade has been the development of non-asymptotic excess risk bounds for ERM on this problem under weak assumptions, and which match, up to constant factors, the asymptotically exact rate.

In this paper, we consider the more general family of $p$ -th power losses $t\mapsto\lvert t\rvert^{p}$ for a user-chosen $p\in(1,\infty)$ . Under mild assumptions, the classical asymptotic theory can still be applied to ERM under these losses, yielding the asymptotic distribution of the excess risk. However, to the best of our knowledge, the problem of deriving non-asymptotic excess risk bounds for ERM for $p\in(1,\infty)\setminus\{2\}$ remains open, and, as we discuss below, resists the application of standard tools from the literature.

Our motivation for extending the case $p=2$ to $p\in(1,\infty)$ is twofold. Firstly, the freedom in the choice of $p$ allows us to better capture our prediction goals. For example, we might only care about how accurate our prediction is on average, in which case, the choice $p=1$ is appropriate. At the other extreme, we might insist that we do as well as possible on a subset of inputs of probability $1$ , in which case the choice $p=\infty$ is best. A choice of $p\in(1,\infty)$ therefore allows us to interpolate between these two extremes, with the case $p=2$ offering a balanced choice. Secondly, different choices of $p$ have complementary qualities. On the one hand, small values of $p$ allow us to operate with weak moment assumptions, making them applicable in more general cases. On the other, larger values of $p$ yield predictions whose optimality is less sensitive to changes in the underlying distribution: for $p=\infty$ , the best predictor depends only on the support of this distribution.

To sharpen our discussion, let us briefly formalize our problem. There is an input random vector $X\in\mathbb{R}^{d}$ and output random variable $Y\in\mathbb{R}$ , and we are provided with $n$ i.i.d. samples $(X_{i},Y_{i})_{i=1}^{n}$ . We select our set of predictors to be the class of linear functions $\left\{x\mapsto\langle w,x\rangle\mid w\in\mathbb{R}^{d}\right\}$ , and choose a value $p\in(1,\infty)$ with the corresponding loss $\ell_{p}(t)\vcentcolon=\lvert t\rvert^{p}/[p(p-1)]$ . Using this loss, we define the associated risk and empirical risk, respectively, by

\displaystyle R_{p}(w)

\displaystyle\vcentcolon=\operatorname{E}[\ell_{p}(\langle w,X\rangle-Y)],

\displaystyle R_{p,n}(w)\vcentcolon=\frac{1}{n}\sum_{i=1}^{n}\ell_{p}(\langle w,X_{i}\rangle-Y_{i}).

We perform empirical risk minimization $\hat{w}_{p}\in\operatorname*{\arg\!\min}_{w\in\mathbb{R}^{d}}R_{p,n}(w)$ , and our goal is to derive a high probability bound on the excess risk $R_{p}(\hat{w}_{p})-R_{p}(w_{p}^{*})$ , where $w_{p}^{*}$ is the risk minimizer. For efficient algorithms for computing an empirical risk minimizer $\hat{w}_{p}$ , we refer the reader to the rich recent literature dealing with this problem [BCLL18, AKPS19, APS19, JLS22].

To see why the problem we are considering is difficult, let us briefly review some of the recent literature. Most closely related to our problem are the results of [AC11, HKZ12, Oli16, LM16], who derive high probability non-asymptotic excess risk bounds for the case $p=2$ . The best such bounds are found in [Oli16] and [LM16], who both operate under weak assumptions on $(X,Y)$ , requiring at most the existence of fourth moments of $Y$ and the components $X^{j}$ of $X$ for $j\in[d]$ . Unfortunately, the analysis in [Oli16] relies on the closed form expression of the empirical risk minimizer $\hat{w}_{2}$ , and therefore cannot be extended to other values of $p$ . Similarly, the analysis in [LM16] relies on an exact decomposition of the excess loss $\ell_{2}(\langle w,X\rangle-Y)-\ell_{2}(\langle w_{p}^{*},X\rangle-Y)$ in terms of “quadratic” and “multiplier” components, which also does not extend to other values of $p$ .

To address these limitations, the work of [Men18] extends the ideas of [Men14, LM16] to work for loss functions more general than the square loss. Roughly speaking, the main result of [Men18] states that as long as the loss is strongly convex and smooth in a neighbourhood of $0$ , the techniques developed by [Men14] can still be applied to obtain high probability excess risk bounds. Unfortunately, the loss functions $\ell_{p}(t)$ are particularly ill-behaved in precisely this sense, as $\ell_{p}^{\prime\prime}(t)\to 0$ when $t\to 0$ for $p>2$ , and $\lvert\ell_{p}^{\prime\prime}(t)\rvert\to\infty$ as $t\to 0$ for $p\in(1,2)$ . This makes the analysis of the excess risk of ERM in the case $p\in(1,\infty)\setminus\{2\}$ particularly challenging using well-established methods.

Contrary to the non-asymptotic regime, the asymptotic properties of the excess risk of ERM under the losses $\ell_{p}(t)$ are better understood [Ron84, BRW92, Nie92, Arc96, HS96, LL05], and can be derived from the more general classical asymptotic theory of $M$ -estimators [LC83, VW96, Vaa98] under mild regularity conditions. In particular, these asymptotic results imply that the excess risk of ERM with $n$ samples satisfies

\operatorname{E}[R_{p}(\hat{w}_{p})]-R_{p}(w_{p}^{*})=\frac{\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert^{2}_{H_{p}^{-1}}\right]}{2n}+o\left\lparen\frac{1}{n}\right\rparen\quad\text{as}\quad n\to\infty,

(1)

where $H_{p}\vcentcolon=\nabla^{2}R_{p}(w_{p}^{*})$ is the Hessian of the risk at its minimizer. We refer the reader to the discussions in [OB21, MG22] for more details. As we demonstrate in Theorem 1, the rate of convergence of ERM for the square loss derived in [Oli16] and [LM16] matches the asymptotic rate (1) up to a constant factor. Ideally, we would like our excess risk bounds for the cases $p\in(1,\infty)\setminus\{2\}$ to also match the asymptotic rate (1), although it is not yet clear how to derive any meaningful such bounds.

In this paper, we prove the first high probability non-asymptotic excess risk bounds for ERM under the $p$ -th power losses $\ell_{p}(t)$ for any $p\in(1,\infty)\setminus\{2\}$ . Our assumptions on $(X,Y)$ are weak, arise naturally from the analysis, and reduce to the standard ones for the case $p=2$ . Furthermore, the rate we derive matches, up to a constant that depends only on $p$ , the asymptotically exact rate (1).

We split the analysis in three cases. The first is when the problem is realizable, i.e. $Y=\langle w^{*},X\rangle$ for some $w^{*}\in\mathbb{R}^{d}$ . This edge case is not problematic for the analysis of the case $p=2$ , but as discussed above, the $\ell_{p}(t)$ losses are ill-behaved around $0$ for $p\in(1,\infty)\setminus\{2\}$ , requiring us to treat this case separately. The second case is when the problem is not realizable and $p\in(2,\infty)$ . The final case is when the problem is not realizable and $p\in(1,2)$ , which turns out to be the most technically challenging. In Section 2, we present our main results and in Section 3, we provide their proofs.

Notation. We denote the components of the random vector $X\in\mathbb{R}^{d}$ by $X^{j}$ for $j\in[d]$ . We assume the support of $X$ is not contained in any hyperplane, i.e. $\operatorname{P}\lparen\langle w,X\rangle=0\rparen=1$ only if $w=0$ . This is without loss of generality as discussed in [Oli16, Mou22]. For a positive semi-definite matrix $A$ , we denote the bilinear form it induces on $\mathbb{R}^{d}$ by $\langle\cdot,\cdot\rangle_{A}$ , and define $\lVert\cdot\rVert_{A}\vcentcolon=\sqrt{\langle\cdot,\cdot\rangle_{A}}$ . We define $H_{p,n}\vcentcolon=\nabla^{2}R_{p,n}(w_{p}^{*})$ .

2 Main results

In this section, we state our main results. We start in Section 2.1 where we introduce constants that help us formulate our theorems. In Section 2.2, we state the best known results for both the case $p=2$ and the realizable case where $Y=\langle w^{*},X\rangle$ . Finally, in Section 2.3, we state our theorems.

2.1 Norm equivalence and small ball constants

To state our results, we will need to define two types of quantities first. The first kind are related to norms and their equivalence constants, which we will use in the analysis of the non-realizable case. The second are small ball probabilities, which we will use in the analysis of the realizable case.

We start by introducing the following functions on our space of coefficients $\mathbb{R}^{d}$ . For $p,q\in[1,\infty)$ , define, with the convention $\infty^{1/p}\vcentcolon=\infty$ for all $p\in[1,\infty)$ ,

\displaystyle\lVert w\rVert_{L^{p}}

\displaystyle\vcentcolon=\operatorname{E}[\lvert\langle w,X\rangle\rvert^{p}]^{1/p},

\displaystyle\lVert w\rVert_{L^{q},p}

\displaystyle\vcentcolon=\operatorname{E}[\lVert w\rVert_{\nabla^{2}\ell_{p}(\langle w^{*}_{p},X\rangle-Y)}^{q}]^{1/q}.

(2)

As suggested by the notation, under appropriate moment assumptions on $X$ , these functions are indeed norms on $\mathbb{R}^{d}$ . In that case, we will be interested in norm equivalence constants between them

\displaystyle C_{a\to b}

\displaystyle\vcentcolon=\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\frac{\lVert w\rVert_{a}}{\lVert w\rVert_{b}},

\displaystyle\sigma_{p}^{2}

\displaystyle\vcentcolon=C_{(L^{4},p)\to(L^{2},p)}^{4},

(3)

where $a$ and $b$ stand for one of $L^{p}$ or $(L^{q},p)$ . Let us note that since we work in a finite dimensional vector space, all norms are equivalent, so that as soon as the quantities defined in (2) are indeed norms, the constants defined in (3) are finite. Furthermore, as suggested by the notation, $\sigma^{2}_{p}$ may be viewed as the maximum second moment of the random variables $\lVert w\rVert_{\nabla^{2}\ell_{p}(\langle w_{p}^{*},X\rangle-Y)}^{2}$ over the unit sphere of $\lVert\cdot\rVert_{L^{2},p}$ . Finally, we record the following identities for future use

\displaystyle\lVert w\rVert_{L^{2},p}

\displaystyle=\lVert w\rVert_{H_{p}},

\displaystyle\lVert w\rVert_{L^{q},2}

\displaystyle=\lVert w\rVert_{L^{q}},

\displaystyle\sigma^{2}_{2}=C_{L^{4},L^{2}}^{4}.

(4)

The first identity holds by linearity, and the second by noticing that $\nabla^{2}\ell_{2}(\langle w,X\rangle-Y)=XX^{T}$ .

We now turn to small ball probabilities. We define the following functions on $\mathbb{R}^{d}$ , for $q\in[1,\infty)$ ,

\displaystyle\rho_{0}(w)

\displaystyle\vcentcolon=\operatorname{P}\lparen\langle w,X\rangle=0\rparen,

\displaystyle\rho_{q}(w,\kappa)

\displaystyle\vcentcolon=\operatorname{P}\lparen\lvert\langle w,X\rangle\rvert>\kappa\lVert w\rVert_{L^{q}}\rparen.

(5)

Assumptions on the functions $\rho_{0}$ and $\rho_{2}$ have been used extensively in the recent literature, see e.g. [Men14, KM15, LM17, LM17a, Men18, LM18, Mou22]. In particular, a standard assumption postulates the existence of strictly positive constants $\beta_{0}$ , and $(\beta_{2},\kappa_{2})$ such that $\rho_{0}(w)\leq 1-\beta_{0}$ and $\rho_{2}(w,\kappa_{2})\geq\beta_{2}$ for all $w\in\mathbb{R}^{d}$ . Conditions of this type are usually referred to as small ball conditions. Efforts have been made to understand when these conditions hold [Men14, RV15, LM17a] as well as reveal the dimension dependence of the constants with which they do [Sau18]. Here we prove that such conditions always hold for finite dimensional spaces. We leave the proof of Lemma 1 to Appendix B to not distract from our main development.

Lemma 1.

$\rho_{0}$ is upper semi-continuous. Furthermore, if for some $q\in[1,\infty)$ , $\operatorname{E}[\lvert X^{j}\rvert^{q}]<\infty$ for all $j\in[d]$ , then $\rho_{q}(\cdot,\kappa)$ is lower semi-continuous for any $\kappa\geq 0$ . Moreover, for all $\kappa\in[0,1)$

\displaystyle\rho

\displaystyle\vcentcolon=\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{0}(w)<1,

\displaystyle\inf_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{q}(w,\kappa)

\displaystyle>0.

2.2 Background

To better contextualize our results, we start by stating the best known high probability bound on ERM for the square loss, which we deduce from [Oli16] and [LM16].

Theorem 1 (Theorem 4.2, [Oli16]; Theorem 1.3, [LM16]).

Assume that $\operatorname{E}[Y^{2}]<\infty$ and $\operatorname{E}[(X^{j})^{4}]<\infty$ for all $j\in[d]$ , and let $\delta\in(0,1]$ . If

n\geq 196\sigma^{2}_{2}\left\lparen d+2\log(4/\delta)\right\rparen,

then, with probability at least $1-\delta$

R_{2}(\hat{w}_{2})-R_{2}(w_{2}^{*})\leq\frac{16\operatorname{E}[\lVert\nabla\ell_{2}(\langle w^{*}_{2},X\rangle-Y)\rVert^{2}_{H_{2}^{-1}}]}{n\delta}.

Up to a constant factor and the dependence on $\delta$ , Theorem 1 recovers the asymptotically exact rate (1). Let us briefly comment on the differences between Theorem 1 and the comparable statements in the original papers. First, the finiteness of $\sigma_{2}^{2}$ is deduced from the finiteness of the fourth moments of the components of $X$ , instead of being assumed as in [Oli16] (see the discussion in Section 3.1 in [Oli16]). Second we combine Theorem 3.1 from [Oli16] with the proof technique of [LM16] to achieve a slightly better bound that the one achieved by the proof technique used in the proof of Theorem 4.2 in [Oli16], while avoiding the dependence on the small ball-constant present in the bound of Theorem 1.3 in [LM16], which is known to incur additional dimension dependence in some cases [Sau18].

We now move to the realizable case, where $Y=\langle w^{*},X\rangle$ so that $w_{p}^{*}=w^{*}$ for all $p\in(1,\infty)$ . We immediately note that Theorem 1 is still applicable in this case, and ensures that we recover $w^{*}$ exactly with no more than $n=O(\sigma_{2}^{2}d)$ samples. However, we can do much better, while getting rid of all the moment assumptions in Theorem 1. Indeed, it is not hard to see that $\hat{w}_{p}\neq w^{*}$ only if for some $w\in\mathbb{R}^{d}\setminus\{0\}$ , $\langle w,X_{i}\rangle=0$ for all $i\in[n]$ (taking $w=\hat{w}_{p}-w_{p}^{*}$ works). The implicit argument in Theorem 1 then uses the pointwise bound (see Lemma B.2 in [Oli16])

\operatorname{P}\lparen\cap_{i=1}^{n}\{\langle w,X_{i}\rangle=0\}\rparen\leq\exp\left\lparen-\frac{n}{2\sigma^{2}_{2}}\right\rparen,

and uniformizes it over the $L^{2}$ unit sphere in $\mathbb{R}^{d}$ , where the $L^{2}$ norm is as defined in (2). However, we can use the much tighter bound $\rho^{n}$ where $\rho$ is as defined in Lemma 1. To the best of our knowledge, the realizable case has not been studied explicitly before in the literature. However, with the above considerations in mind, we can deduce the following result from [LM17a], which uniformizes the pointwise bound we just discussed using a VC dimension argument.

Theorem 2 (Corollary 2.5, [LM17a]).

Assume that there exists $w^{*}\in\mathbb{R}^{d}$ such that $Y=\langle w^{*},X\rangle$ . Let $\delta\in(0,1]$ . If

n\geq O\left\lparen\frac{d+\log(1/\delta)}{(1-\rho)^{2}}\right\rparen

then for any $p\in(1,\infty)$ , $\hat{w}_{p}=w^{*}$ with probability at least $1-\delta$ .

2.3 Results

We are now in position to state our main results. As discussed in Section 1, the $\ell_{p}(t)$ losses have degenerate second derivatives as $t\to 0$ . When the problem is realizable, the risk is not twice differentiable at its minimizer for the cases $p\in(1,2)$ , and is degenerate for the cases $p\in(2,\infty)$ . If we want bounds of the form (1), we must exclude this case from our analysis. Our first main result is a strengthening of Theorem 2, and relies on a combinatorial argument to uniformize the pointwise estimate discussed in Section 2.2.

Theorem 3.

Assume that there exists $w^{*}\in\mathbb{R}^{d}$ such that $\langle w^{*},X\rangle=Y$ . Then for all $n\geq d$ , and for all $p\in(1,\infty)$ , we have

\operatorname{P}\left\lparen\hat{w}_{p}\neq w^{*}\right\rparen\leq{n\choose d-1}\rho^{n-d+1},

where $\rho$ is as defined in Lemma 1. Furthermore, if

n\geq\begin{dcases*}O\left(d+\log\lparen 1/\delta\rparen/\log\lparen 1/\rho\rparen\right)&\quad if \quad$0\leq\rho<e^{-1}$,\\ O\left(\frac{d+\log\lparen 1/\delta\rparen}{1-\rho}\right)&\quad if \quad$e^{-1}\leq\rho<e^{-1/e}$,\\ O\left(\frac{d\log\lparen 1/(1-\rho)\rparen+\log\lparen 1/\delta\rparen}{1-\rho}\right)&\quad if \quad$e^{-1/e}\leq\rho<1$,\end{dcases*}

then with probability at least $1-\delta$ , $\hat{w}_{p}=w^{*}$ .

Comparing Theorem 2 and Theorem 3, we see that the bound on the number of samples required to reach a confidence level $\delta$ in Theorem 3 is uniformly smaller than the one in Theorem 2. The proof of Theorem 3 can be found in Appendix C.

We now move to the more common non-realizable case. Our first theorem here gives a non-asymptotic bound for the excess risk of ERM under a $p$ -th power loss for $p\in(2,\infty)$ . To the best of our knowledge, no such result is known in the literature.

Theorem 4.

Let $p\in(2,\infty)$ and $\delta\in(0,1]$ . Assume that no $w\in\mathbb{R}^{d}$ satisfies $Y=\langle w,X\rangle$ . Further, assume that $\operatorname{E}[\lvert Y\rvert^{p}]<\infty$ , $\operatorname{E}[\lvert X^{j}\rvert^{p}]<\infty$ , and $\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}(X^{j})^{4}]<\infty$ for all $j\in[d]$ . If

n\geq 196\sigma^{2}_{p}\lparen d+2\log\lparen 4/\delta\rparen\rparen,

then with probability at least $1-\delta$

R_{p}(\hat{w}_{p})-R_{p}(w^{*}_{p})\leq\frac{2048p^{2}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\\ +\left\lparen\frac{512p^{4}c_{p}^{2}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\right\rparen^{p/2},

where we used $c_{p}$ to denote $C_{L^{p}\to(L^{2},p)}$ as defined in (3).

Up to a constant factor that depends only on $p$ and the dependence on $\delta$ , the bound of Theorem 4 is precisely of the form of the optimal bound (1). Indeed, as $p>2$ , the second term is $o(1/n)$ . At the level of assumptions, the finiteness of the $p$ -th moment of $Y$ and the components of $X$ is necessary to ensure that the risk $R_{p}$ is finite for all $w\in\mathbb{R}^{d}$ . The last assumption $\operatorname{E}[\lvert Y-\langle w_{p}^{*},X\rangle\rvert^{2(p-2)}(X^{j})^{4}]<\infty$ is a natural extension of the fourth moment assumption in Theorem 1. In fact, all three assumptions in Theorem 4 reduce to those of Theorem 1 as $p\to 2$ . It is worth noting that the constant $c_{p}$ has the alternative expression $\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\{\lVert w\rVert_{L^{p}}/\lVert w\rVert_{H_{p}}\}$ by (4), i.e. it is the norm equivalence constant between the $L^{p}$ norm and the norm induced by $H_{p}$ . Using again (4), we see that $c_{p}\to 1$ as $p\to 2$ . As $p\to\infty$ , $c_{p}$ grows, and we suspect in a dimension dependent way. However, this does not affect the asymptotic optimality of our rate as $c_{p}$ only enters an $o(1/n)$ term in our bound.

We now turn to the case of $p\in(1,2)$ where we need a slightly stronger version of non-realizability.

Theorem 5.

Let $p\in(1,2)$ and $\delta\in(0,1]$ . Assume that $\operatorname{P}\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2-p}>0\rparen=1$ and $\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}]<\infty$ . Further, assume that $\operatorname{E}[\lvert Y\rvert^{p}]<\infty$ , $\operatorname{E}[(X^{j})^{2}]<\infty$ , $\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}(X^{j})^{4}]<\infty$ for all $j\in[d]$ . If

n\geq 196\sigma_{p}^{2}(d+2\log(4/\delta)),

then, with probability at least $1-\delta$

R_{p}(\hat{w}_{p})-R_{p}(w_{p}^{*})\leq\frac{32768}{p-1}\frac{\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\\ +\frac{1}{p-1}\left\lparen\frac{2097152\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]\sigma_{p}^{6-2p}c_{p}^{2-p}c^{*}_{p}}{n\delta}\right\rparen^{1/(p-1)}

where we used $c^{*}_{p}$ to denote $\operatorname{E}[\lvert Y-\langle w_{p}^{*},X\rangle\rvert^{2(p-2)}]$ and $c_{p}\vcentcolon=\operatorname{Tr}\left\lparen H_{p}^{-1}\Sigma\right\rparen$ where $\Sigma\vcentcolon=\operatorname{E}\left[XX^{T}\right]$ .

Just like the bounds of Theorems 1 and 4, the bound of Theorem 5 is asymptotically optimal up to a constant factor that depends only on $p$ . Indeed, since $1<p<2$ , $1/(p-1)>1$ , and the second term is $o(1/n)$ . At the level of assumptions, we have two additional conditions compared to Theorem 4. First, we require the existence of the second moment of the covariates instead of just the $p$ -th moment. Second, we require a stronger version of non-realizability by assuming the existence of the $2(2-p)$ negative moment of $\lvert\langle w_{p}^{*},X\rangle-Y\rvert$ . In the majority of applications, an intercept variable is included as a covariate, i.e. $X^{1}=1$ , so that this negative moment assumption is already implied by the standard assumption $\operatorname{E}[\lvert Y-\langle w_{p}^{*},X\rangle\rvert^{2(p-2)}(X^{j})^{4}]<\infty$ . In the rare case where an intercept variable is not included, any negative moment assumption on $\lvert\langle w_{p}^{*},X\rangle-Y\rvert$ can be used instead, at the cost of a larger factor in the $o(1/n)$ term.

Finally, it is worth noting that for the cases $p\in[1,2)$ , there are situations where the asymptotic bound (1) does not hold, as the limiting distribution of the coefficients $\hat{w}_{p}$ as $n\to\infty$ does not necessarily converge to a Gaussian, and depends heavily on the distribution of $\langle w_{p}^{*},X\rangle-Y$ , see e.g. [LL05] and [Kni98]. Overall, we suspect that perhaps a slightly weaker version of our assumptions is necessary for a fast rate like (1) to hold.

3 Proofs

3.1 Proof of Theorem 1

Here we give a detailed proof of Theorem 1. While the core technical result can be deduced by combining results from [Oli16] and [LM16], here we frame the proof in a way that makes it easy to extend to the cases $p\in(1,\infty)$ , and differently from either paper. We split the proof in three steps. First notice that since the loss is a quadratic function of $w$ , we can express it exactly using a second order Taylor expansion around the minimizer $w_{2}^{*}$

\ell_{2}(\langle w,X\rangle-Y)-\ell_{2}(\langle w^{*}_{2},X\rangle-Y)=\langle\nabla\ell_{2}(\langle w^{*}_{2},X\rangle-Y),w-w_{2}^{*}\rangle+\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{\nabla^{2}\ell_{2}(\langle w^{*}_{2},X\rangle-Y)}^{2}.

Taking empirical averages and expectations of both sides respectively shows that the excess empirical risk and excess risk also admit such an expansion

	$\displaystyle R_{2,n}(w)-R_{2,n}(w^{*}_{2})$	$\displaystyle=\langle\nabla R_{2,n}(w_{2}^{}),w-w_{2}^{}\rangle+\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{H_{2,n}}^{2},$
	$\displaystyle R_{2}(w)-R_{2}(w^{*}_{2})$	$\displaystyle=\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{H_{2}}^{2},$		(6)

where in the second equality we used that the gradient of the risk vanishes at the minimizer $w_{2}^{*}$ . Therefore, to bound the excess risk, it is sufficient to bound the norm $\lVert w-w_{2}^{*}\rVert_{H_{2}}$ . This is the goal of the second step, where we use two ideas. First, by definition, the excess empirical risk of the empirical risk minimizer satisfies the upper bound

R_{2,n}(\hat{w}_{2})-R_{2,n}(w_{2}^{*})\leq 0.

(7)

Second, we use the Cauchy-Schwartz inequality to lower bound the excess empirical risk by

R_{2,n}(\hat{w}_{2})-R_{2,n}(w_{2}^{*})\geq-\lVert\nabla R_{2,n}(w_{2}^{*})\rVert_{H_{2}^{-1}}\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}+\frac{1}{2}\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2,n}}^{2},

(8)

and we further lower bound it by deriving high probability bounds on the two random terms $\lVert\nabla R_{2,n}(w_{2}^{*})\rVert_{H_{2}^{-1}}$ and $\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2,n}}^{2}$ . The first can easily be bounded using Chebyshev’s inequality and the elementary fact that the variance of the average of $n$ i.i.d. random variables is the variance of their common distribution divided by $n$ . Here we state the result for all $p\in(1,\infty)$ ; the straightforward proof can be found in the Appendix D.

Lemma 2.

Let $p\in(1,\infty)$ . If $p\in(1,2)$ , let the assumptions of Theorem 5 hold. Then with probability at least $1-\delta/2$

\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\leq\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}.

For the second random term $\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2,n}}^{2}$ , we use Theorem 3.1 of [Oli16], which we restate here, emphasizing that the existence of fourth moments of the components of the random vector is enough to ensure the existence of the needed norm equivalence constant.

Proposition 1 (Theorem 3.1, [Oli16]).

Let $Z\in\mathbb{R}^{d}$ be a random vector satisfying $\operatorname{E}[Z_{j}^{4}]<\infty$ for all $j\in[d]$ and assume that $\operatorname{P}\left\lparen\langle v,Z\rangle=0\right\rparen=1$ only if $v=0$ . For $p\in[1,\infty)$ and $v\in\mathbb{R}^{d}$ , define

\displaystyle\lVert v\rVert_{L^{p}}

\displaystyle\vcentcolon=\operatorname{E}[(\langle v,Z\rangle)^{p}]^{1/p},

\displaystyle\sigma^{2}

\displaystyle\vcentcolon=\left\lparen\sup_{v\in\mathbb{R}^{d}\setminus\{0\}}\lVert v\rVert_{L^{4}}/\lVert v\rVert_{L^{2}}\right\rparen^{4}.

Let $(Z_{i})_{i=1}^{n}$ be i.i.d. samples of $Z$ . Then, with probability at least $1-\delta$ , for all $v\in\mathbb{R}^{d}$ ,

\frac{1}{n}\sum_{i=1}^{n}\langle v,Z_{i}\rangle^{2}\geq\left\lparen 1-7\sigma\sqrt{\frac{d+2\log(2/\delta)}{n}}\right\rparen\lVert v\rVert_{L^{2}}^{2}.

Using this result we can immediately deduce the required high probability lower bound on the second random term $\lVert\hat{w}_{2}-w^{*}_{2}\rVert_{H_{2,n}}^{2}$ ; we leave the obvious proof to Appendix D.

Corollary 1.

Under the assumptions of Theorem 1, if $n\geq 196\sigma_{2}^{2}(d+2\log(4/\delta))$ , then with probability at least $1-\delta/2$ , for all $w\in\mathbb{R}^{d}$ ,

\lVert w-w_{2}^{*}\rVert_{H_{2,n}}^{2}\geq\frac{1}{2}\lVert w-w_{2}^{*}\rVert_{H_{2}}^{2}.

Combining Lemma 2, Corollary 1, and (8) yields that with probability at least $1-\delta$

R_{2,n}(\hat{w}_{2})-R_{2,n}(w_{2}^{*})\geq-\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}\,\,\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}+\frac{1}{4}\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}^{2}.

(9)

Finally, combining (7) and (9) gives that with probability at least $1-\delta$

\lVert\hat{w}_{2}-w_{2}^{*}\rVert_{H_{2}}\leq 4\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}.

Replacing in (6) finishes the proof. ∎

3.2 Proof Sketch of Theorem 4

The main challenge in moving from the case $p=2$ to the case $p\in(2,\infty)$ is that the second order Taylor expansion of the loss is no longer exact. The standard way to deal with this problem is to assume that the loss is upper and lower bounded by quadratic functions, i.e. that it is smooth and strongly convex. Unfortunately, as discussed in Section 1, the $\ell_{p}$ loss is not strongly convex for any $p>2$ , so we need to find another way to deal with this issue. Once this has been resolved however, the strategy we used in the proof of Theorem 1 can be applied almost verbatim to yield the result. Remarkably, a result of [AKPS22] allows us to upper and lower bound the $p$ -th power loss for $p\in(2,\infty)$ by its second order Taylor expansion around a point, up to some residual terms. An application of this result yields the following Lemma.

Lemma 3.

Let $p\in(2,\infty)$ . Then:

	$\displaystyle R_{p,n}(w)-R_{p,n}(w_{p}^{*})$	$\displaystyle\geq\frac{1}{8(p-1)}\lVert w-w_{p}^{}\rVert_{H_{p,n}}^{2}+\langle\nabla R_{p,n}(w_{p}^{}),w-w_{p}^{*}\rangle,$		(10)
	$\displaystyle R_{p}(w)-R_{p}(w_{p}^{*})$	$\displaystyle\leq\frac{2p}{(p-1)}\lVert w-w_{p}^{}\rVert_{H_{p}}^{2}+p^{p}\lVert w-w_{p}^{}\rVert_{L^{p}}^{p}.$		(11)

Up to constant factors that depend only on $p$ and an $L^{p}$ norm residual term, Lemma 3 gives matching upper and lower bounds on the excess risk and excess empirical risk in terms of their second order Taylor expansions around the minimizer. We can thus use the approach taken in the proof of Theorem 1 to obtain the result. The only additional challenge is the control of the term $\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{L^{p}}$ , which we achieve by reducing it to an ${\lVert\hat{w}_{p}-w_{p}^{*}\rVert}_{H_{p}}$ term using norm equivalence. A detailed proof of Theorem 4, including the proof of Lemma 3, can be found in Appendix E.

3.3 Proof Sketch of Theorem 5

The most technically challenging case is when $p\in(1,2)$ . Indeed as seen in the proof of Theorem 1, the most involved step is lower bounding the excess empirical risk with high probability. For the case $p\in[2,\infty)$ , we achieved this by having access to a pointwise quadratic lower bound, which is not too surprising. Indeed, at small scales, we expect the second order Taylor expansion to be accurate, while at large scales, we expect the $p$ -th power loss to grow at least quadratically for $p\in[2,\infty)$ .

In the case of $p\in(1,2)$ , we are faced with a harder problem. Indeed, as $p\to 1$ , the $\ell_{p}$ losses behave almost linearly at large scales. This means that we cannot expect to obtain a global quadratic lower bound as for the case $p\in[2,\infty)$ , so we will need a different proof technique. Motivated by related concerns, [BCLL18] introduced the following approximation to the $p$ -th power function

\gamma_{p}(t,x)\vcentcolon=\begin{dcases*}\frac{p}{2}t^{p-2}x^{2}&\quad if \quad$x\leq t$\\ x^{p}-\left\lparen 1-\frac{p}{2}\right\rparen t^{p}&\quad if \quad$x>t$,\end{dcases*}

for $t,x\in[0,\infty)$ and with $\gamma_{p}(0,0)=0$ . This function was further studied by [AKPS19], who showed in particular that for any $t\in\mathbb{R}$ , the function $x\mapsto\gamma_{p}(\lvert t\rvert,\lvert x\rvert)$ is, up to constants that depend only on $p$ , equal to the gap between the function $x\mapsto\ell_{p}(t+x)$ and its linearization around $0$ ; see Lemma 4.5 in [AKPS19] for the precise statement. We use this result to derive the following Lemma.

Lemma 4.

Let $p\in(1,2)$ . Under the assumptions of Theorem 5, we have

	$\displaystyle R_{p,n}(w)-R_{p,n}(w_{p}^{*})$	$\displaystyle\geq\frac{1}{4p^{2}}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{},X_{i}\rangle\rvert\right\rparen+\langle\nabla R_{p,n}(w_{p}^{}),w-w_{p}^{}\rangle,$		(12)
	$\displaystyle R_{p}(w)-R_{p}(w_{p}^{*})$	$\displaystyle\leq\frac{4}{(p-1)}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2}.$		(13)

As expected, while we do have the desired quadratic upper bound, the lower bound is much more cumbersome, and is only comparable to the second order Taylor expansion when $\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\leq\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert$ . What we need for the proof to go through is a high probability lower bound of order $\Omega(\lVert w-w^{*}\rVert_{H_{p}}^{2})$ on the first term in the lower bound (12). We obtain this in the following Proposition.

Proposition 2.

Let $\delta\in(0,1]$ . Under the assumptions of Theorem 5, if $n\geq 196\sigma_{p}^{2}(d+2\log(4/\delta))$ , then with probability at least $1-\delta/2$ , for all $w\in\mathbb{R}^{d}$ ,

\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen\geq\frac{p}{8}\min\left\{\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2},\varepsilon^{2-p}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{p}\right\},

where $\varepsilon^{p-2}\vcentcolon=8\sigma_{p}^{3-p}c_{p}^{(2-p)/2}\sqrt{c^{*}_{p}}$ , and $c_{p}$ and $c^{*}_{p}$ are as defined in Theorem 5.

Proof.

Let $\varepsilon>0$ and let $T\in(0,\infty)$ be a truncation parameter we will set later. Define

\tilde{X}\vcentcolon=X\cdot\mathbbm{1}_{[0,T]}(\lVert X\rVert_{H_{p}^{-1}}),

and the constant $\beta\vcentcolon=T\varepsilon$ . By Lemma 3.3 in [AKPS19], we have that $\gamma_{p}(t,\lambda x)\geq\min\{\lambda^{2},\lambda^{p}\}\gamma_{p}(t,x)$ for all $\lambda\geq 0$ . Furthermore, it is straightforward to verify that $\gamma_{p}(t,x)$ is decreasing in $t$ and increasing in $x$ . Therefore, we have, for all $w\in\mathbb{R}^{d}$ ,

	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{},X_{i}\rangle\rvert\right\rparen$
	$\displaystyle\geq\min\left\{\varepsilon^{-2}\lVert w-w_{p}\rVert_{H_{p}}^{2},\varepsilon^{-p}\lVert w-w_{p}\rVert_{H_{p}}^{p}\right\}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\left\lvert\langle w_{p}^{},X_{i}\rangle-Y_{i}\right\rvert,\left\lvert\left\langle\frac{\varepsilon(w-w_{p}^{})}{\lVert w-w_{p}^{*}\rVert_{H_{p}}},X_{i}\right\rangle\right\rvert\right\rparen$
	$\displaystyle\geq\min\left\{\varepsilon^{-2}\lVert w-w_{p}\rVert_{H_{p}}^{2},\varepsilon^{-p}\lVert w-w_{p}\rVert_{H_{p}}^{p}\right\}\cdot\inf_{\lVert w\rVert_{H_{p}}=\varepsilon}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w,X_{i}\rangle\rvert\right\rparen.$		(14)

The key idea to control the infimum in (14) is to truncate $\langle w,X_{i}\rangle$ from above by using the truncated vector $\tilde{X}$ , and $\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert$ from below by forcing it to be greater than $\beta$ . By the monotonicity properties of $\gamma_{p}$ discussed above, we get that

	$\displaystyle\inf_{\lVert w\rVert_{H_{p}}=\varepsilon}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w,X_{i}\rangle\rvert\right\rparen$
	$\displaystyle\geq\inf_{\lVert w\rVert_{H_{p}}=\varepsilon}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}(\max\left\{\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\beta\right\},\lvert\langle w,\tilde{X}_{i}\rangle\rvert)$
	$\displaystyle=\frac{\varepsilon^{2}p}{2}\inf_{\lVert w\rVert_{H_{p}}=1}\frac{1}{n}\sum_{i=1}^{n}\max\left\{\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\beta\right\}^{p-2}\lvert\langle w,\tilde{X}_{i}\rangle\rvert^{2},$		(15)

where the equality follows by the fact that with the chosen truncations, the second argument of $\gamma_{p}$ is less than or equal to the first. It remains to lower bound the infimum in (15). Define

Z=\max\left\{\lvert\langle w_{p}^{*},X\rangle-Y\rvert,\beta\right\}^{(p-2)/2}\tilde{X}.

Removing the truncations and using our assumptions, we see that the components of $Z$ have finite fourth moment. By Proposition 1 and the condition on $n$ , we get that with probability at least $1-\delta/2$ ,

	$\displaystyle\inf_{\lVert w\rVert_{H_{p}}=1}\frac{1}{n}\sum_{i=1}^{n}\max\left\{\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\beta\right\}^{p-2}\lvert\langle w,\tilde{X}_{i}\rangle\rvert^{2}$
	$\displaystyle=\inf_{\lVert w\rVert_{H_{p}}=1}\frac{1}{n}\sum_{i=1}^{n}\langle w,Z_{i}\rangle^{2}\geq\frac{1}{2}\inf_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\langle w,Z\rangle^{2}\right]$
	$\displaystyle=\frac{1}{2}\inf_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\max\left\{\lvert\langle w_{p}^{*},X\rangle-Y\rvert,\beta\right\}^{(p-2)}\langle w,\tilde{X}\rangle^{2}\right]$
	$\displaystyle\geq\frac{1}{2}\left\lparen 1-\sup_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\lvert\langle w_{p}^{},X\rangle-Y\rvert^{p-2}\langle w,X\rangle^{2}\left\lparen\mathbbm{1}_{[0,\beta)}(\lvert\langle w_{p}^{},X\rangle-Y\rvert)+\mathbbm{1}_{(T,\infty)}(\lVert X\rVert_{H_{p}^{-1}})\right\rparen\right]\right\rparen$		(16)

We now bound the supremum in (16). We have

	$\displaystyle\sup_{\lVert w\rVert_{H_{p}}=1}\operatorname{E}\left[\lvert\langle w_{p}^{},X\rangle-Y\rvert^{p-2}\langle w,X\rangle^{2}\left\lparen\mathbbm{1}_{[0,\beta)}(\lvert\langle w_{p}^{},X\rangle-Y\rvert)+\mathbbm{1}_{(T,\infty)}(\lVert X\rVert_{H_{p}^{-1}})\right\rparen\right]$
	$\displaystyle\leq\sup_{\lVert w\rVert_{H_{p}}=1}\left\{\operatorname{E}\left[\lvert\langle w_{p}^{},X\rangle-Y\rvert^{2(p-2)}\langle w,X\rangle^{4}\right]^{1/2}\right\}\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{},X\rangle-Y\rvert<\beta\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\right\rparen^{1/2}$
	$\displaystyle=\left\lparen\sup_{\lVert w\rVert_{H_{p}}=1}\lVert w\rVert_{L^{4},p}^{2}\right\rparen\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\right\rparen^{1/2}$
	$\displaystyle=\sigma_{p}\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\right\rparen^{1/2},$		(17)

where the first inequality follows from Cauchy-Schwartz inequality, and the subsequent equalities by definitions of $\lVert\cdot\rVert_{L^{4},p}$ and $\sigma^{2}_{p}$ . It remains to bound the tail probabilities. Recall that $\beta=T\varepsilon$ , so

	$\displaystyle\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<\beta\right\rparen$	$\displaystyle=\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<T\varepsilon\right\rparen$
		$\displaystyle=\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{-1}>(T\varepsilon)^{-1}\right\rparen$
		$\displaystyle=\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}>(T\varepsilon)^{2(p-2)}\right\rparen$
		$\displaystyle\leq\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{2(p-2)}](T\varepsilon)^{2(2-p)}$
		$\displaystyle=c^{*}_{p}(T\varepsilon)^{2(2-p)},$

where we applied Markov’s inequality in the fourth line, and the last follows by definition of $c^{*}_{p}$ in Theorem 5. Moreover by the finiteness of the second moment of the coordinates $X^{j}$ of $X$ , we have

\operatorname{E}\left[\lVert X\rVert_{H_{p}^{-1}}^{2}\right]=\operatorname{E}\left[X^{T}H_{p}^{-1}X\right]=\operatorname{E}\left[\operatorname{Tr}\left\lparen H^{-1}_{p}XX^{T}\right\rparen\right]=\operatorname{Tr}\left\lparen H_{p}^{-1}\Sigma\right\rparen=c_{p}

where $\Sigma=\operatorname{E}[XX^{T}]$ , and the last equality by definition of $c_{p}$ in Theorem 5. By Markov’s inequality

\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<T\varepsilon\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T\right\rparen\leq c^{*}_{p}T^{2(2-p)}\varepsilon^{2(2-p)}+\frac{c_{p}}{T^{2}}.

Choosing

\displaystyle T\vcentcolon=\left\lparen\frac{c_{p}}{c^{*}_{p}(2-p)}\right\rparen^{1/(6-2p)},

\displaystyle\varepsilon^{2-p}\vcentcolon=\frac{1}{8\sigma_{p}^{3-p}\sqrt{c^{*}_{p}}\cdot c_{p}^{(2-p)/2}},

ensures that

\sigma_{p}\left\lparen\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert<T^{*}\varepsilon\right\rparen+\operatorname{P}\left\lparen\lVert X\rVert_{H_{p}^{-1}}>T^{*}\right\rparen\right\rparen^{1/2}\leq 1/2.

(18)

Combining the inequalities (18), (17), (16), (15), and (14) yields the result. ∎

A detailed proof of Theorem 5, including the proof of Lemma 4, can be found in Appendix F.

Acknowledgments and Disclosure of Funding

We thank Nikita Zhivotovskiy, Sushant Sachdeva, and Deeksha Adil for feedback on the manuscript. MAE was partially supported by NSERC Grant [2019-06167], CIFAR AI Chairs program, and CIFAR AI Catalyst grant.

References

[AC11] Jean-Yves Audibert and Olivier Catoni “Robust Linear Least Squares Regression” In The Annals of Statistics, 2011 DOI: 10.1214/11-AOS918
[AKPS19] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Iterative Refinement for $\ell_{p}$ -norm Regression” In Proceedings of the 2019 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019 DOI: 10.1137/1.9781611975482.86
[AKPS22] Deeksha Adil, Rasmus Kyng, Richard Peng and Sushant Sachdeva “Fast Algorithms for $\ell_{p}$ -Regression”, 2022 DOI: 10.48550/arXiv.2211.03963
[APS19] Deeksha Adil, Richard Peng and Sushant Sachdeva “Fast, Provably Convergent IRLS Algorithm for p-norm Linear Regression” In Advances in Neural Information Processing Systems, 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/hash/46c7cb50b373877fb2f8d5c4517bb969-Abstract.html
[Arc96] Miguel A. Arcones “The Bahadur-Kiefer Representation of $L_{p}$ Regression Estimators” In Econometric Theory, 1996 URL: https://www.jstor.org/stable/3532831
[AS19] Deeksha Adil and Sushant Sachdeva “Faster $p$ -Norm Minimizing Flows, via Smoothed q-Norm Problems” In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2019 DOI: 10.1137/1.9781611975994.54
[BCLL18] Sébastien Bubeck, Michael B. Cohen, Yin Tat Lee and Yuanzhi Li “An Homotopy Method for $\ell_{p}$ Regression Provably beyond Self-Concordance and in Input-Sparsity Time” In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, 2018 DOI: 10.1145/3188745.3188776
[BRW92] Z.. Bai, C. Rao and Y. Wu “M-Estimation of Multivariate Linear Regression Parameters Under a Convex Discrepancy Function” In Statistica Sinica, 1992 URL: https://www.jstor.org/stable/24304129
[CG17] Olivier Catoni and Ilaria Giulini “Dimension-Free PAC-Bayesian Bounds for Matrices, Vectors, and Linear Least Squares Regression”, 2017 DOI: 10.48550/arXiv.1712.02747
[HKZ12] Daniel Hsu, Sham M. Kakade and Tong Zhang “Random Design Analysis of Ridge Regression” In Proceedings of the 25th Annual Conference on Learning Theory, 2012 URL: https://proceedings.mlr.press/v23/hsu12.html
[HS96] Xuming He and Qi-Man Shao “A General Bahadur Representation of M-estimators and Its Application to Linear Regression with Nonstochastic Designs” In The Annals of Statistics, 1996 DOI: 10.1214/aos/1032181172
[JLS22] Arun Jambulapati, Yang P. Liu and Aaron Sidford “Improved Iteration Complexities for Overconstrained $p$ -Norm Regression” In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, 2022 DOI: 10.1145/3519935.3519971
[KM15] Vladimir Koltchinskii and Shahar Mendelson “Bounding the Smallest Singular Value of a Random Matrix Without Concentration” In International Mathematics Research Notices, 2015 DOI: 10.1093/imrn/rnv096
[Kni98] Keith Knight “Limiting Distributions for $L_{1}$ Regression Estimators under General Conditions” In The Annals of Statistics, 1998 DOI: 10.1214/aos/1028144858
[LC83] Erich L. Lehmann and George Casella “Theory of Point Estimation”, 1983
[LL05] P.. Lai and Stephen M.. Lee “An Overview of Asymptotic Properties of $L_{p}$ Regression under General Classes of Error Distributions” In Journal of the American Statistical Association, 2005 URL: https://www.jstor.org/stable/27590567
[LM16] Guillaume Lecué and Shahar Mendelson “Performance of Empirical Risk Minimization in Linear Aggregation” In Bernoulli, 2016 DOI: 10.3150/15-BEJ701
[LM17] Guillaume Lecué and Shahar Mendelson “Regularization and the Small-Ball Method II: Complexity Dependent Error Rates” In Journal of Machine Learning Research, 2017 URL: http://jmlr.org/papers/v18/16-422.html
[LM17a] Guillaume Lecué and Shahar Mendelson “Sparse Recovery under Weak Moment Assumptions” In Journal of the European Mathematical Society, 2017 DOI: 10.4171/jems/682
[LM18] Guillaume Lecué and Shahar Mendelson “Regularization and the Small-Ball Method I: Sparse Recovery” In The Annals of Statistics, 2018 DOI: 10.1214/17-AOS1562
[Men14] Shahar Mendelson “Learning without Concentration” In Proceedings of The 27th Conference on Learning Theory, 2014 URL: https://proceedings.mlr.press/v35/mendelson14.html
[Men18] Shahar Mendelson “Learning without Concentration for General Loss Functions” In Probability Theory and Related Fields, 2018 DOI: 10.1007/s00440-017-0784-y
[Men21] Shahar Mendelson “Learning Bounded Subsets of $L_{p}$ ” In IEEE Transactions on Information Theory, 2021 DOI: 10.1109/TIT.2021.3083553
[MG22] Jaouad Mourtada and Stéphane Gaïffas “An Improper Estimator with Optimal Excess Risk in Misspecified Density Estimation and Logistic Regression” In Journal of Machine Learning Research, 2022 URL: http://jmlr.org/papers/v23/20-782.html
[Mou22] Jaouad Mourtada “Exact Minimax Risk for Linear Least Squares, and the Lower Tail of Sample Covariance Matrices” In The Annals of Statistics, 2022 DOI: 10.1214/22-AOS2181
[MVZ22] Jaouad Mourtada, Tomas Vaškevičius and Nikita Zhivotovskiy “Distribution-free robust linear regression” In Mathematical Statistics and Learning, 2022 DOI: 10.4171/MSL/27
[Nie92] Wojciech Niemiro “Asymptotics for $M$ -Estimators Defined by Convex Minimization” In The Annals of Statistics, 1992 DOI: 10.1214/aos/1176348782
[OB21] Dmitrii M. Ostrovskii and Francis Bach “Finite-Sample Analysis of $M$ -Estimators Using Self-Concordance” In Electronic Journal of Statistics, 2021 DOI: 10.1214/20-EJS1780
[Oli16] Roberto Imbuzeiro Oliveira “The Lower Tail of Random Quadratic Forms with Applications to Ordinary Least Squares” In Probability Theory and Related Fields, 2016 DOI: 10.1007/s00440-016-0738-9
[Ron84] Arjen E. Ronner “Asymptotic Normality of $p$ -norm Estimators in Multiple Regression” In Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 1984 DOI: 10.1007/BF00531893
[RV15] Mark Rudelson and Roman Vershynin “Small Ball Probabilities for Linear Images of High-Dimensional Distributions” In International Mathematics Research Notices, 2015 DOI: 10.1093/imrn/rnu243
[Sau18] Adrien Saumard “On Optimality of Empirical Risk Minimization in Linear Aggregation” In Bernoulli, 2018 DOI: 10.3150/17-BEJ925
[Vaa98] A.. Vaart “Asymptotic Statistics”, 1998 DOI: 10.1017/CBO9780511802256
[VW96] Aad W. Van Der Vaart and Jon A. Wellner “Weak Convergence and Empirical Processes”, 1996 DOI: 10.1007/978-1-4757-2545-2
[VZ23] Tomas Vaškevičius and Nikita Zhivotovskiy “Suboptimality of Constrained Least Squares and Improvements via Non-Linear Predictors” In Bernoulli, 2023 DOI: 10.3150/22-BEJ1465
[Whi82] Halbert White “Maximum Likelihood Estimation of Misspecified Models” In Econometrica, 1982 DOI: 10.2307/1912526

Appendix A Differentiability of the risk

In this section, we rigorously establish the twice differentiability the risk under our assumptions. We start by showing that under a subset of our assumptions, the risk is differentiable everywhere on $\mathbb{R}^{d}$ .

Lemma 5.

Let $p\in(1,\infty)$ and assume that $\operatorname{E}[\lvert Y\rvert^{p}]<\infty$ and $\operatorname{E}[\lvert X_{j}\rvert^{p}]<\infty$ for all $j\in[d]$ . Then $R_{p}$ is differentiable on $\mathbb{R}^{d}$ , and

\nabla R_{p}(w)=\operatorname{E}[\nabla\ell_{p}(\langle w,X\rangle-Y)].

Proof.

Let $w\in\mathbb{R}^{d}$ . We want to show that

\lim_{\Delta\to 0}\frac{\left\lvert R_{p}(w+\Delta)-R_{p}(w)-\operatorname{E}\left[\langle\nabla\ell_{p}(\langle w,X\rangle-Y),\Delta\rangle\right]\right\rvert}{\lVert\Delta\rVert}=0,

where, for convenience, we take the norm $\lVert\cdot\rVert$ to be the Euclidean norm. Define the function $\phi(w,X,Y)\vcentcolon=\ell_{p}(\langle w,X\rangle-Y)$ and note that by the chain rule $\phi$ is differentiable as a function of $w$ on all of $\mathbb{R}^{d}$ . Now let $(\Delta_{k})_{k=1}^{n}$ be a sequence in $\mathbb{R}^{d}$ such that $\lim_{k\to\infty}\lVert\Delta_{k}\rVert=0$ . Then

	$\displaystyle\lim_{k\to\infty}\frac{\left\lvert R_{p}(w+\Delta_{k})-R_{p}(w)-\operatorname{E}\left[\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right]\right\rvert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle=\lim_{k\to\infty}\frac{\left\lvert\operatorname{E}\left[\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)-\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right]\right\rvert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\lim_{k\to\infty}\operatorname{E}\left[\frac{\left\lvert\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)-\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right\rvert}{\lVert\Delta_{k}\rVert}\right].$		(19)

Our goal is to interchange the limit and expectation. For that, we will use the dominated convergence theorem. We construct our dominating function as follows. Let $R\vcentcolon=\sup_{k\in\mathbb{N}}\lVert\Delta_{k}\rVert$ , and note that $R<\infty$ since $\lVert\Delta_{k}\rVert\to 0$ as $k\to\infty$ . Then we have

	$\displaystyle\frac{\left\lvert\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)-\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right\rvert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\left\lvert\phi(w+\Delta_{k},X,Y)-\phi(w,X,Y)\right\rvert}{\lVert\Delta_{k}\rVert}+\frac{\left\lvert\langle\nabla\phi(w,X,Y),\Delta_{k}\rangle\right\rvert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\left\langle\int_{0}^{1}\nabla\phi(w+t\Delta_{k},X,Y)dt,\Delta_{k}\right\rangle}{\lVert\Delta_{k}\rVert}+\lVert\nabla\phi(w,X,Y)\rVert$
	$\displaystyle\leq\left\lVert\int_{0}^{1}\nabla\phi(w+t\Delta_{k},X,Y)dt\right\rVert+\lVert\nabla\phi(w,X,Y)\rVert$
	$\displaystyle\leq\int_{0}^{1}\lVert\nabla\phi(w+t\Delta_{k},X,Y)\rVert dt+\lVert\nabla\phi(w,X,Y)\rVert$
	$\displaystyle\leq 2\sup_{\Delta\in B(0,R)}\lVert\nabla\phi(w+\Delta,X,Y)\rVert$
	$\displaystyle\leq\frac{2}{p-1}\lVert X\rVert\sup_{\Delta\in B(0,R)}\left\lvert\langle w+\Delta,X\rangle-Y\right\rvert^{p-1}$
	$\displaystyle\leq\frac{2}{p-1}\lVert X\rVert\sup_{\Delta\in B(0,R)}\max\{2^{p-1},1\}\left\lparen\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}+\left\lvert\langle\Delta,X\rangle\right\rvert^{p-1}\right\rparen$
	$\displaystyle=\frac{2^{p}}{p-1}\left\{\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}\lVert X\rVert+R^{p-1}\lVert X\rVert^{p}\right\}=\vcentcolon g(X,Y),$

where the second line follows by triangle inequality, the third from the fundamental theorem of calculus applied component-wise, the fourth by Cauchy-Schwartz inequality, the fifth by Jensen’s inequality and the convexity of the norm, and the eighth by the inequality $\lvert a+b\rvert^{q}\leq\max\{2^{q-1},1\}(\lvert a\rvert^{q}+\lvert b\rvert^{q})$ valid for $q>0$ . It remains to show that $g(X,Y)$ is integrable. We have

	$\displaystyle\operatorname{E}\left[g(X,Y)\right]$	$\displaystyle=\frac{2^{p}}{p-1}\operatorname{E}\left[\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}\lVert X\rVert+R^{p-1}\lVert X\rVert^{p}\right]$
		$\displaystyle=\frac{2^{p}}{p-1}\left\{\sum_{j=1}^{d}\operatorname{E}\left[\left\lvert\langle w,X\rangle-Y\right\rvert^{p-1}\lvert X^{j}\rvert\right]+R^{p-1}\operatorname{E}\left[\left\lparen\sum_{j=1}^{d}\lvert X^{j}\rvert\right\rparen^{p}\right]\right\}$
		$\displaystyle\leq\frac{2^{p}}{p-1}\left\{\sum_{j=1}^{d}\operatorname{E}\left[\lvert\langle w,X\rangle-Y\rvert^{p}\right]^{\frac{p-1}{p}}\operatorname{E}\left[\lvert X_{j}\rvert^{p}\right]^{1/p}+R^{p-1}d^{p}\sum_{j=1}^{d}\operatorname{E}\left[\lvert X^{j}\rvert^{p}\right]\right\}$
		$\displaystyle<\infty,$

where in the second line we used that the Euclidean norm is bounded by the $1$ -norm, in the third we used Holder’s inequality, and the last line follows from our assumptions. Applying the dominated convergence theorem, we interchange the limit and the expectation in (19). Recalling that $\phi$ is differentiable finishes the proof. ∎

We now turn to the twice differentiability of the risk. We start with the easy case $p\in[2,\infty)$ . The proof is very similar to that of Lemma 5 and we omit it here.

Lemma 6.

Let $p\in[2,\infty)$ and assume that $\operatorname{E}[\lvert Y\rvert^{p}]<\infty$ and $\operatorname{E}[\lvert X_{j}\rvert^{p}]<\infty$ for all $j\in[d]$ . Then $R_{p}$ is twice differentiable on $\mathbb{R}^{d}$ , and

\nabla^{2}R_{p}(w)=\operatorname{E}[\nabla^{2}\ell_{p}(\langle w,X\rangle-Y)].

The case $p\in(1,2)$ is more complicated. The following lemma establishes the twice differentiability of the risk at its minimizer under a subset of the assumptions of Theorem 5.

Lemma 7.

Let $p\in(1,2)$ . Assume that $\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert=0\right\rparen=0$ and $\operatorname{E}[\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{p-2}(X^{j})^{2}]<\infty$ for all $j\in[d]$ . Then $R_{p}$ is twice differentiable at $w_{p}^{*}$ and

\nabla^{2}R_{p}(w_{p}^{*})=\operatorname{E}[\nabla^{2}\ell_{p}(\langle w_{p}^{*},X\rangle-Y)]

Proof.

The difficulty in the proof compared to Lemma 5 and Lemma 6 stems from the fact that the loss is not twice differentiable at zero. We still rely on the dominated convergence theorem, but the construction of the dominating function is slightly more intricate. Using the setup of the proof of Lemma 5, and following the same line of arguments, we arrive at

	$\displaystyle\lim_{k\to\infty}\frac{\lVert\nabla R_{p}(w_{p}^{}+\Delta_{k})-\nabla R_{p}(w_{p}^{})-\operatorname{E}\left[\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\right]\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\lim_{k\to\infty}\operatorname{E}\left[\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}\right],$		(20)

where we have used the fact that since $\operatorname{P}\left\lparen\lvert\langle w_{p}^{*},X\rangle-Y\rvert=0\right\rparen=0$ , $\phi(w,X,Y)$ is almost surely twice differentiable at $w_{p}^{*}$ . To finish the proof, it remains to construct a dominating function for the above sequence to justify the interchange of the limit and expectation. We consider two cases.

Case 1: $\lVert\Delta_{k}\rVert\geq\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert/(2\lVert X\rVert)=\vcentcolon R(X,Y)$ . Then we have

	$\displaystyle\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)\rVert+\lVert\nabla\phi(w_{p}^{},X,Y)\rVert+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\left\lparen\left\lvert\langle w_{p}^{}+\Delta,X\rangle-Y\right\rvert^{p-1}+\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-1}\right\rparen\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert}+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}$
	$\displaystyle\leq\frac{2\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-1}\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert}+\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}+\frac{\lvert\langle\Delta_{k}/\lVert\Delta_{k}\rVert,X\rangle\rvert^{p-1}\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert^{2-p}}$
	$\displaystyle\leq\frac{4\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}}{(p-1)}+\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}+\frac{\lVert X\rVert^{p}}{(p-1)\lVert\Delta_{k}\rVert^{2-p}}$
	$\displaystyle\leq\frac{7\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}}{(p-1)}$

where the second line follows by triangle inequality, the third by definition of the operator norm, the fourth by $\lvert a+b\rvert^{q}\leq\lvert a\rvert^{q}+\lvert b\rvert^{q}$ valid for $q\in(0,1)$ , and the fifth and sixth by Cauchy-Schwartz inequality and the assumed lower bound on $\lVert\Delta_{k}\rVert$ .

Case 2: $\lVert\Delta_{k}\rVert<R(X,Y)$ . We start by noting that, for all $\Delta\in B(0,R(X,Y))\vcentcolon=\left\{x\in\mathbb{R}^{d}\mid\lVert x\rVert<R(X,Y)\right\}$ , we have

\left\lvert\langle w_{p}^{*}+\Delta,X\rangle-Y\right\rvert\geq\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert-\left\lvert\langle\Delta,X\rangle\right\rvert\geq\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert-\lVert\Delta\rVert\lVert X\rVert>\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert/2>0.

Therefore $\phi(w,X,Y)$ is twice differentiable on $B(0,R(X,Y))$ . Now

	$\displaystyle\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{},X,Y)\rVert+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\left\lVert\left\lparen\int_{0}^{1}\nabla^{2}\phi(w_{p}^{}+t\Delta_{k},X,Y)dt\right\rparen\Delta_{k}\right\rVert}{\lVert\Delta_{k}\rVert}+\lVert\nabla^{2}\phi(w_{p}^{},X,Y)\rVert_{op}$
	$\displaystyle\leq\left\lVert\int_{0}^{1}\nabla^{2}\phi(w+t\Delta_{k},X,Y)dt\right\rVert_{op}+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}$
	$\displaystyle\leq\int_{0}^{1}\lVert\nabla^{2}\phi(w+t\Delta_{k},X,Y)\rVert_{op}dt+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}$
	$\displaystyle\leq 2\sup_{\Delta\in B(0,R(X,Y))}\lVert\nabla^{2}\phi(w_{p}^{*}+\Delta,X,Y)\rVert_{op}$
	$\displaystyle\leq 2\lVert X\rVert_{2}^{2}\sup_{\Delta\in B(0,R(X,Y))}\left\lvert\langle w_{p}^{*}+\Delta,X\rangle-Y\right\rvert^{p-2}$
	$\displaystyle\leq 4\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert_{2}^{2}$

where the second line follows from the triangle inequality, the third follows from the twice differentiability of $\phi$ on $B(0,R(X,Y))$ and the fundamental theorem of calculus applied component-wise, the fifth by Jensen’s inequality, and the last by definition of $R(X,Y)$ and the above lower bound. We therefore define our dominating function by

g(X,Y)\vcentcolon=8\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert_{2}^{2}.

It is then immediate from our assumptions that $g(X,Y)$ is integrable. Interchanging the limit and the expectation in (20) and recalling that $\phi$ is almost surely twice differentiable at $w_{p}^{*}$ finishes the proof.

∎

Appendix B Proof of Lemma 1

We start with the claim that $\rho_{0}$ is upper-semicontinuous. We want to show that for any $w\in\mathbb{R}^{d}$ and any sequence $(w_{k})_{k=1}^{\infty}$ converging to $w$ (in the norm topology)

\limsup_{k\to\infty}\rho_{0}(w_{k})\leq\rho_{0}(w).

Fix a $w\in\mathbb{R}^{d}$ and let $(w_{k})_{k=1}^{\infty}$ be a sequence in $\mathbb{R}^{d}$ satisfying $\lim_{k\to\infty}\lVert w-w_{k}\rVert=0$ , where for convenience we take $\lVert\cdot\rVert$ to be the Euclidean norm on $\mathbb{R}^{d}$ . Then we have by (reverse) Fatou’s Lemma

\limsup_{k\to\infty}\rho_{0}(w_{k})=\limsup_{k\to\infty}\operatorname{E}\left[\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)\right]\\ \leq\operatorname{E}\left[\limsup_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)\right].

(21)

Now we bound the inner limsup pointwise. We split this task in two cases. If $\langle w,X\rangle=0$ , then

\limsup_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)\leq 1=\mathbbm{1}_{\{0\}}(\langle w,X\rangle).

(22)

Otherwise we have $\delta\vcentcolon=\lvert\langle w,X\rangle\rvert>0$ . But then, by the convergence of $(w_{k})_{k=1}^{\infty}$ to $w$ , there exists a $K\in\mathbb{N}$ such that for all $k\geq K$ we have $\lVert w_{k}-w\rVert<\delta/(2\lVert X\rVert)$ . This implies that for all $k\geq K$

\lvert\langle w_{k},X\rangle\rvert=\lvert\langle w,X\rangle-\langle w-w_{k},X\rangle\rvert\geq\lvert\langle w,X\rangle\rvert-\lvert\langle w-w_{k},X\rangle\rvert\geq\delta-\lVert w_{k}-w\rVert_{2}\lVert X\rVert\geq\delta/2>0.

We conclude that

\limsup_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)=\lim_{k\to\infty}\mathbbm{1}_{\{0\}}(\langle w_{k},X\rangle)=0=\mathbbm{1}_{\{0\}}(\langle w,X\rangle).

(23)

Combining (21), (22), and (23) proves the upper-semicontinuity of $\rho_{0}$ . Essentially the same proof shows the lower-semicontinuity of $\rho_{q}(\cdot,\kappa)$ for any $\kappa\geq 0$ ; we omit it here.

For the remaining claims, first notice that the function $\rho_{0}$ is scale invariant, i.e. for all $w\in\mathbb{R}^{d}$ and all $c\in\mathbb{R}$ , we have $\rho_{0}(cw)=\rho_{0}(w)$ . Therefore

\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{0}(w)=\sup_{w\in\mathbb{S}^{d-1}}\rho_{0}(w),

where $\mathbb{S}^{d-1}$ is the Euclidean unit sphere. By assumption on the random vector $X$ , we know that $\rho_{0}(w)<1$ for all $w\in\mathbb{S}^{d-1}$ . Furthermore since $\rho_{0}$ is upper semicontinuous, and $\mathbb{S}^{d-1}$ is compact, $\rho_{0}$ attains its supremum on $\mathbb{S}^{d-1}$ at some point $w_{0}\in\mathbb{S}^{d-1}$ . From this we conclude that

\rho=\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{0}(w)=\rho_{0}(w_{0})<1.

Finally, we turn to the claim about $\rho_{q}$ . Since $\operatorname{E}[\lvert X^{j}\rvert^{q}]<\infty$ , the function $\lVert\cdot\rVert_{L^{q}}$ is a norm on $\mathbb{R}^{d}$ , from which it follows that $\rho_{q}(w,\kappa)$ is also scale invariant for any $\kappa$ . Therefore

\inf_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{q}(w,\kappa)=\inf_{w\in S_{q}}\rho_{q}(w,\kappa),

where $S_{q}$ is the unit sphere of the norm $\lVert\cdot\rVert_{L^{q}}$ . Now fix $\kappa\in[0,1)$ . We claim that $\rho_{q}(w,\kappa)>0$ for all $w\in S_{q}$ . Suppose not. Then there exists a $w\in S_{q}$ such that $\lvert\langle w,X\rangle\rvert\leq\kappa$ with probability $1$ , but then we get the contradiction

1=\lVert w\rVert_{L^{q}}=\operatorname{E}[\lvert\langle w,X\rangle\rvert^{q}]^{1/q}\leq\kappa<1.

therefore $\rho_{q}(w,\kappa)>0$ for all $w\in S_{q}$ . Now since $\rho_{q}(\cdot,\kappa)$ is lower-semicontinuous, and $S_{q}$ is compact, $\rho_{q}(\cdot,\kappa)$ attains its infimum on $S_{q}$ at some point $w_{q}\in S_{q}$ . From this we conclude

\inf_{w\in\mathbb{R}^{d}\setminus\{0\}}\rho_{q}(w,\kappa)=\rho_{q}(w_{q},\kappa)>0.

∎

Appendix C Proof of Theorem 3

Fix $p\in(1,\infty)$ , and let $\hat{w}\vcentcolon=\hat{w}_{p}$ . Our goal will be to upper bound the probability $\operatorname{P}\left\lparen\hat{w}\neq w^{*}\right\rparen$ . By assumption, we have that $Y=\langle w^{*},X\rangle$ , so that $Y_{i}=\langle w^{*},X_{i}\rangle$ for all $i\in[n]$ . Since $\hat{w}$ minimizes the empirical risk, we must also have $\langle\hat{w},X_{i}\rangle=Y_{i}=\langle w^{*},X_{i}\rangle$ for all $i\in[n]$ . Let $A\in\mathbb{R}^{n\times d}$ denote the matrix whose $i$ -th row is $X_{i}$ . Then we have the following implications.

\hat{w}\neq w^{*}\Rightarrow\langle\hat{w}-w^{*},X_{i}\rangle=0\,\,\forall i\in[n]\Rightarrow\exists w\in\mathbb{R}^{d}\setminus\{0\}\mid Aw=0\Leftrightarrow\operatorname{rowrank}(A)<d.

(24)

Let $r\vcentcolon=\operatorname{rowrank}(A)$ . We claim the following equivalence

\operatorname{rowrank}(A)<d\Leftrightarrow\exists S\subset[n]\mid\lvert S\rvert=d-1\land\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen.

(25)

Indeed the implication $(\Leftarrow)$ follows by definition of the rowrank of $A$ . For the implication $(\Rightarrow)$ , by definition, $\left\{X_{i}\mid i\in[n]\right\}$ is a spanning set for the row space of $A$ , therefore it can be reduced to a basis of it $\left\{X_{k}\mid k\in S_{1}\right\}$ for some indices $S_{1}\subset[n]$ with $\lvert S_{1}\rvert=r$ . If $r=d-1$ , then the choice $S=S_{1}$ satisfies the right side of (25). Otherwise, let $S_{2}\subset[n]\setminus S_{1}$ with $\lvert S_{2}\rvert=d-1-r$ . Such a subset exists since by assumption $n\geq d>d-1$ . Then the set $S\vcentcolon=S_{1}\cup S_{2}$ satisfies the right side of (25). Combining (24) and (25) we arrive at:

	$\displaystyle\operatorname{P}\left\lparen\hat{w}\neq w^{*}\right\rparen$	$\displaystyle\leq\operatorname{P}\left\lparen\bigcup_{\begin{subarray}{c}S\subset[n]\\ \lvert S\rvert=d-1\end{subarray}}\left\{\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen\right\}\right\rparen$
		$\displaystyle\leq\sum_{\begin{subarray}{c}S\subset[n]\\ \lvert S\rvert=d-1\end{subarray}}\operatorname{P}\left\lparen\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen\right\rparen$		(26)

where the second inequality follows from the union bound. We now bound each of the terms of the sum. Fix $S=\left\{i_{1},\dotsc,i_{d-1}\right\}\subset[n]$ with $\lvert S\rvert=d-1$ . Let $Z_{S}=n((X_{i_{j}})_{j=1}^{d-1})$ be a non-zero vector orthogonal to $\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen$ . Such a vector must exist since $\dim(\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen)<d$ ; see Lemma 8 below for an explicit construction of the function $n$ . Denote by $P_{Z_{S}}$ the distribution of $Z_{S}$ and $P_{(X_{i})_{i\in[n]\setminus S}}=\prod_{i=1}^{n-d-1}P_{X}$ the distribution of $(X_{i})_{i\in[n]\setminus S}$ , where $P_{X}$ is the distribution of $X$ . Note that since $Z_{S}$ is a function of $(X_{i_{j}})_{j=1}^{d}$ only, it is independent of $(X_{i})_{i\in[n]\setminus S}$ . In particular, the joint distribution of $(Z_{S},(X_{i})_{i\in[n]\setminus S})$ is given by the product $P_{Z_{S}}\times P_{(X_{i})_{i\in[n]\setminus S}}$ . Now if $X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen$ , then by definition of $Z_{S}$ , $\langle Z_{S},X_{i}\rangle=0$ . Therefore

	$\displaystyle\operatorname{P}\left\lparen\forall i\in[n]\setminus S\,\,X_{i}\in\operatorname{span}\left\lparen\left\{X_{k}\mid k\in S\right\}\right\rparen\right\rparen\leq\operatorname{P}\left\lparen\forall i\in[n]\setminus S\,\,\langle Z_{S},X_{i}\rangle=0\right\rparen$
	$\displaystyle=\operatorname{E}\left[\prod_{i\in[n]\setminus S}\mathbbm{1}_{\left\{0\right\}}(\langle Z_{S},X_{i}\rangle)\right]$
	$\displaystyle=\int\left\{\prod_{i\in[n]\setminus S}\mathbbm{1}_{\left\{0\right\}}(\langle y_{S},x_{i}\rangle)\right\}P_{Z_{S}}(dz_{S})P_{(X_{i})_{i\in[n]\setminus S}}(d(x_{i})_{i\in[n]\setminus S})$
	$\displaystyle=\int P_{Z_{S}}(dy_{s})\left\{\prod_{i\in[n]\setminus S}\int\mathbbm{1}_{\left\{0\right\}}(\langle y_{S},x_{i}\rangle)P_{X}(dx_{i})\right\}$
	$\displaystyle=\int\left\{\prod_{i\in[n]\setminus S}\operatorname{P}\left\lparen\langle z_{S},X\rangle=0\right\rparen\right\}P_{Z_{S}}(dy_{s})$
	$\displaystyle=\int\left\{\prod_{i\in[n]\setminus S}\rho_{0}(y_{s})\right\}P_{Z_{S}}(dy_{s})$
	$\displaystyle\leq\rho^{n-d+1},$		(27)

where in the third line we used the independence of $Z_{S}$ and $(X_{i})_{i\in[n]\setminus S}$ , in the fourth we used the independence of the $(X_{i})_{i\in\setminus S}$ , in the sixth we used the definition of $\rho_{0}$ in 5, and in the last line we used the fact that $z_{S}\neq 0$ and the definition of $\rho$ in Lemma 1. Combining the inequalities (26) and (27) yields the result. ∎

Lemma 8.

Let $m\in\left\{1,\dotsc,d-1\right\}$ and let $(x_{j})_{j=1}^{m}$ be a sequence of points in $\mathbb{R}^{d}$ . Denote by $A\in\mathbb{R}^{m\times d}$ the matrix whose $j$ -th row is $x_{j}$ and let $A^{+}$ be its pseudo-inverse. Let $(b_{i})_{i=1}^{d}$ be an ordered basis of $\mathbb{R}^{d}$ , and define

	$\displaystyle k\vcentcolon=\min\left\{i\in[n]\mid(I-A^{+}A)b_{i}\neq 0\right\}$
	$\displaystyle n((x_{j})_{j=1}^{m})\vcentcolon=(I-A^{+}A)b_{k}$

Then $n((x_{j})_{j=1}^{m})$ is non-zero and is orthogonal to $\operatorname{span}\left\lparen\left\{x_{j}\mid j\in[m]\right\}\right\rparen$ .

Proof.

We start by showing that $k$ is well defined. First note that $I-A^{+}A$ is the orthogonal projector onto the kernel of $A$ , which is non-trivial since $\dim(\ker(A))=d-\dim(\operatorname{Im}(A))\geq d-m\geq 1$ . Now we claim that there exists an $i\in[d]$ such that $(I-A^{+}A)b_{i}\neq 0$ . Suppose not, then for any $w\in\mathbb{R}^{d}$ , we have $(I-A^{+}A)w=(I-A^{+}A)(\sum_{i=1}^{d}c_{i}b_{i})=\sum_{i=1}^{d}c_{i}(I-A^{+}A)b_{i}=0$ , implying that $I-A^{+}A=0$ , but this contradicts the non-triviality of $\ker(A)$ . This proves that $k$ is well-defined, which in turn proves that $n((x_{j})_{j=1}^{m})\neq 0$ . It remains to prove the orthogonality claim. Let $v\in\operatorname{span}\left\lparen\left\{x_{j}\mid j\in[m]\right\}\right\rparen$ . Then there exists coefficients $c\in\mathbb{R}^{m}$ such that $v=A^{T}c$ . Therefore

\langle v,n((x_{j})_{j=1}^{m})\rangle=\langle A^{T}c,n((x_{j})_{j=1}^{m})\rangle=c^{T}A(I-AA^{+})b_{k}=0,

where the last equality holds since $(I-AA^{+})b_{k}\in\ker(A)$ . ∎

Appendix D Missing proofs for Theorem 1

This section contains the proofs of Lemma 2 and Corollary 1 we used in the proof of Theorem 1.

D.1 Proof of Lemma 2

We compute the expectation:

	$\displaystyle\operatorname{E}\left[\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}^{2}\right]$	$\displaystyle=\operatorname{E}\left[\lVert n^{-1}\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})\rVert_{H_{p}^{-1}}^{2}\right]$
		$\displaystyle=n^{-2}\sum_{i=1}^{n}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})\rVert_{H_{p}^{-1}}^{2}\right]$
		$\displaystyle\quad+2n^{-2}\sum_{i=1}^{n}\sum_{j=1}^{i-1}\langle\operatorname{E}[\nabla\ell_{p}(\langle w_{p}^{},X_{i}\rangle-Y_{i})],\operatorname{E}[\nabla\ell_{p}(\langle w_{p}^{},X_{j}\rangle-Y_{j})]\rangle_{H_{p}^{-1}}$
		$\displaystyle=n^{-1}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]$

where in the second line we expanded the inner product of the sums into its $n^{2}$ terms, used linearity of expectation, and used the independence of the samples to take the expectation inside the inner product. In the last line, we used the fact that the samples are identically distributed to simplify the first term. For the second term, we used the fact that the expectation of the gradient of the loss at the risk minimizer vanishes. Applying Markov’s inequality finishes the proof. ∎

D.2 Proof of Corollary 1

We have

	$\displaystyle\lVert w-w_{2}^{*}\rVert_{H_{2,n}}^{2}$	$\displaystyle=(w-w_{2}^{})^{T}H_{2,n}(w-w_{2}^{})$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(w-w_{2}^{})^{T}\nabla^{2}\ell_{p}(\langle w_{2}^{},X_{i}\rangle-Y_{i})(w-w_{2}^{*})$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle w-w_{2}^{*},X_{i}\rangle^{2}.$

Now by assumption, the components of the vector $X$ have finite fourth moment so that applying Proposition 1 and using the condition on $n$ yields the result. ∎

Appendix E Detailed proof of Theorem 4

This section contains the proof of Lemma 3 as well as that of Theorem 4.

E.1 Proof of Lemma 3

By Lemma 2.5 in [AKPS22], we have for all $t,s\in\mathbb{R}$

\ell_{p}(t)-\ell_{p}(s)-\ell_{p}^{\prime}(s)(t-s)\geq\frac{1}{8(p-1)}\ell_{p}^{\prime\prime}(s)(t-s)^{2}.

Recall that by the chain rule

\displaystyle\nabla\ell_{p}(\langle w,X\rangle-Y)

\displaystyle=\ell_{p}^{\prime}(\langle w,X\rangle-Y)X

\displaystyle\nabla^{2}\ell_{p}(\langle w,X\rangle-Y)=\ell_{p}^{\prime\prime}(\langle w,X\rangle-Y)XX^{T}.

Replacing $t$ and $s$ by $\langle w,X_{i}\rangle-Y_{i}$ and $\langle w_{p}^{*},X_{i}\rangle-Y_{i}$ respectively, and using the formulas for the gradient and Hessian we arrive at

\ell_{p}(\langle w,X_{i}\rangle-Y_{i})-\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})\geq\frac{1}{8(p-1)}(w-w_{p}^{*})^{T}\nabla^{2}\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i})(w-w_{p}^{*})\\ +\langle\nabla\ell_{p}(\langle w_{p}^{*},X_{i}\rangle-Y_{i}),w-w_{p}^{*}\rangle

Averaging over $i\in[n]$ yields the first inequality. The proof of the second inequality proceeds in the same way and uses instead the upper bound of Lemma 2.5 in [AKPS22]. We omit it here. ∎

E.2 Proof of Theorem 4

We proceed similarly to the proof of Theorem 1. By definition of the empirical risk minimizer, we have the upper bound

R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\leq 0.

(28)

Using (10) from Lemma 3 and the Cauchy-Schwarz inequality, we obtain the pointwise lower bound

R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\geq-\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}+\frac{1}{8(p-1)}\lVert\hat{w}_{p}-w_{p}^{*}\rVert^{2}_{H_{p,n}}.

(29)

Using Lemma 2 we have that, with probability at least $1-\delta/2$ ,

\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\leq\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}.

(30)

It remains to control $\lVert\hat{w}_{p}-w_{p}^{*}\rVert^{2}_{H_{p,n}}$ from below. Define the random vector

Z=\lvert\langle w_{p}^{*},X\rangle-Y\rvert^{(p-2)/2}X

Then, for any $w\in\mathbb{R}^{d}$ , we have

	$\displaystyle\lVert w-w_{p}^{*}\rVert_{H_{p,n}}^{2}$	$\displaystyle=(w-w_{p}^{})^{T}H_{p,n}(w-w_{p}^{})$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}(w-w_{p}^{})^{T}\nabla^{2}\ell_{p}(\langle w_{p}^{},X_{i}\rangle-Y_{i})(w-w_{p}^{*})$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle w-w_{p}^{*},Z_{i}\rangle^{2}$

By assumption, the components of the random vector $Z$ have finite fourth moment. Applying Proposition 1, and using the condition on $n$ assumed in the statement of Theorem 4, we get that, with probability at least $1-\delta/2$ , for all $w\in\mathbb{R}^{d}$ ,

\lVert w-w_{p}^{*}\rVert_{H_{p,n}}^{2}\geq\frac{1}{2}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2}.

(31)

Combining (30) and (31) with (29) gives that with probability at least $1-\delta$ ,

R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\geq-\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}\,\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}\\ +\frac{1}{16(p-1)}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}.

(32)

Further combining (32) with (28) and rearranging yields that with probability at least $1-\delta$

\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\frac{512p^{2}\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}

(33)

The last step is to bound the excess risk of the empirical risk minimizer using the bound (33) and (11) from Lemma 3. For that, we control the $L^{p}$ norm term in (11) as follows

$\displaystyle p^{p}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{L^{p}}^{p}$	$\displaystyle=\left\lparen p^{2}\frac{\lVert\hat{w}_{p}-w_{p}^{}\rVert_{L^{p}}^{2}}{\lVert\hat{w}_{p}-w_{p}^{}\rVert_{H_{p}}^{2}}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\right\rparen^{p/2}$
	$\displaystyle\leq\left\lparen p^{2}\sup_{w\in\mathbb{R}^{d}\setminus\{0\}}\left\{\frac{\lVert w\rVert_{L^{p}}^{2}}{\lVert w\rVert_{H_{p}}^{2}}\right\}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\right\rparen^{p/2}$
	$\displaystyle=\left\lparen p^{2}c_{p}^{2}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\right\rparen^{p/2}.$	(34)

Combining (33), (11), and (34) yields the result. ∎

Appendix F Detailed proof of Theorem 5

This section contains the proof of Lemma 4 as well as that of Theorem 5.

F.1 Proof of Lemma 4

Both inequalities follow from Lemma 4.5 in [AKPS19]. (12) follows from a straightforward calculation using the lower bound of Lemma 4.5 in [AKPS19]; we omit it here. The upper bound requires a bit more work. We have by the quoted Lemma

\ell_{p}(t)-\ell_{p}(s)-\ell_{p}^{\prime}(s)(t-s)\leq\frac{4}{p(p-1)}\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert).

Now assume that $\lvert s\rvert>0$ . If $\lvert t-s\rvert\leq\lvert s\rvert$ , we have

\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert)=\frac{p}{2}\lvert s\rvert^{p-2}{(t-s)^{2}}\leq\lvert s\rvert^{p-2}(t-s)^{2}=\ell_{p}^{\prime\prime}(s)(t-s)^{2}.

Otherwise, if $\lvert t-s\rvert>\lvert s\rvert$ , then we have

\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert)=\lvert t-s\rvert^{p}-(1-p/2)\lvert s\rvert^{p}\leq(t-s)^{2}\lvert t-s\rvert^{p-2}\leq\lvert s\rvert^{p-2}(t-s)^{2}=\ell_{p}^{\prime\prime}(s)(t-s)^{2}.

Therefore in both cases we have $\gamma_{p}(\lvert s\rvert,\lvert t-s\rvert)\leq\ell_{p}^{\prime\prime}(s)(t-s)^{2}$ as long as $\lvert s\rvert>0$ . Replacing $t$ and $s$ by $\langle w,X\rangle-Y$ and $\langle w_{p}^{*},X\rangle-Y$ respectively we get, on the event that $\langle w_{p}^{*},X\rangle-Y\neq 0$

\ell_{p}(\langle w,X\rangle-Y)-\ell_{p}(\langle w_{p}^{*},X\rangle-Y)-\langle\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y),w-w_{p}^{*}\rangle\leq\frac{4}{p(p-1)}\lVert w-w_{p}^{*}\rVert_{\nabla^{2}\ell_{p}(\langle w_{p}^{*},X\rangle-Y)}

Recalling that by assumption $\operatorname{P}\left\lparen\langle w_{p}^{*},X\rangle-Y\neq 0\right\rparen=1$ , taking expectation of both sides, and bounding $1/p\leq 1$ finishes the proof of (13). ∎

F.2 Proof of Theorem 5

We follow the same proof strategy as the one used in the proofs of Theorems 1 and 4. By definition of the empirical risk minimizer, we have

R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\leq 0.

(35)

Using (12) from Lemma 4 and the Cauchy-Schwarz inequality, we have the lower bound

R_{p,n}(\hat{w}_{p})-R_{p,n}(w_{p}^{*})\geq-\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}\\ +\frac{1}{4p^{2}}\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen

(36)

Using Lemma 2 we have that, with probability at least $1-\delta/2$ ,

\lVert\nabla R_{p,n}(w_{p}^{*})\rVert_{H_{p}^{-1}}\leq\sqrt{2\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]/(n\delta)}.

(37)

On the other hand, by Proposition 2, we have with probability $1-\delta/2$ , for all $w\in\mathbb{R}^{d}$ ,

\frac{1}{n}\sum_{i=1}^{n}\gamma_{p}\left\lparen\lvert\langle w_{p}^{*},X_{i}\rangle-Y_{i}\rvert,\lvert\langle w-w_{p}^{*},X_{i}\rangle\rvert\right\rparen\geq\frac{p}{8}\min\left\{\lVert w-w_{p}^{*}\rVert_{H_{p}}^{2},\varepsilon^{2-p}\lVert w-w_{p}^{*}\rVert_{H_{p}}^{p}\right\},

(38)

where $\varepsilon$ is as defined in Proposition 2. We now consider two cases. If $\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\varepsilon^{2-p}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{p}$ , then combining (35), (36), (37), and (38) gives

\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}.

(39)

Otherwise, $\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}>\varepsilon^{2-p}\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{p}$ , then again combining (35), (36), (37), and (38) gives

\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\left\lparen\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]\varepsilon^{2(p-2)}}{n\delta}\right\rparen^{1/(p-1)}

(40)

In either case, we have, using (39) and (40), with probability at least $1-\delta$ ,

\lVert\hat{w}_{p}-w_{p}^{*}\rVert_{H_{p}}^{2}\leq\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]}{n\delta}\\ +\left\lparen\frac{8192\operatorname{E}\left[\lVert\nabla\ell_{p}(\langle w_{p}^{*},X\rangle-Y)\rVert_{H_{p}^{-1}}^{2}\right]\varepsilon^{2(p-2)}}{n\delta}\right\rparen^{1/(p-1)}.

Combining this last inequality with (13) from Lemma 4 finishes the proof. ∎

	$\displaystyle\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)-\nabla\phi(w_{p}^{},X,Y)-\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\lVert\nabla\phi(w_{p}^{}+\Delta_{k},X,Y)\rVert+\lVert\nabla\phi(w_{p}^{},X,Y)\rVert+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\Delta_{k}\rVert}{\lVert\Delta_{k}\rVert}$
	$\displaystyle\leq\frac{\left\lparen\left\lvert\langle w_{p}^{}+\Delta,X\rangle-Y\right\rvert^{p-1}+\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-1}\right\rparen\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert}+\lVert\nabla^{2}\phi(w_{p}^{*},X,Y)\rVert_{op}$
	$\displaystyle\leq\frac{2\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-1}\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert}+\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}+\frac{\lvert\langle\Delta_{k}/\lVert\Delta_{k}\rVert,X\rangle\rvert^{p-1}\lVert X\rVert}{(p-1)\lVert\Delta_{k}\rVert^{2-p}}$
	$\displaystyle\leq\frac{4\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}}{(p-1)}+\left\lvert\langle w_{p}^{},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}+\frac{\lVert X\rVert^{p}}{(p-1)\lVert\Delta_{k}\rVert^{2-p}}$
	$\displaystyle\leq\frac{7\left\lvert\langle w_{p}^{*},X\rangle-Y\right\rvert^{p-2}\lVert X\rVert^{2}}{(p-1)}$

Optimal Excess Risk Bounds for Empirical Risk Minimization on pp-Norm Linear Regression

Abstract

1 Introduction

2 Main results

2.1 Norm equivalence and small ball constants

Lemma 1.

2.2 Background

Theorem 1 (Theorem 4.2, [Oli16]; Theorem 1.3, [LM16]).

Theorem 2 (Corollary 2.5, [LM17a]).

2.3 Results

Theorem 3.

Theorem 4.

Theorem 5.

3 Proofs

3.1 Proof of Theorem 1

Lemma 2.

Proposition 1 (Theorem 3.1, [Oli16]).

Corollary 1.

3.2 Proof Sketch of Theorem 4

Lemma 3.

3.3 Proof Sketch of Theorem 5

Lemma 4.

Proposition 2.

Proof.

Acknowledgments and Disclosure of Funding

References

Appendix A Differentiability of the risk

Lemma 5.

Proof.

Lemma 6.

Lemma 7.

Proof.

Appendix B Proof of Lemma 1

Appendix C Proof of Theorem 3

Lemma 8.

Proof.

Appendix D Missing proofs for Theorem 1

D.1 Proof of Lemma 2

D.2 Proof of Corollary 1

Appendix E Detailed proof of Theorem 4

E.1 Proof of Lemma 3

E.2 Proof of Theorem 4

Appendix F Detailed proof of Theorem 5

F.1 Proof of Lemma 4

F.2 Proof of Theorem 5

Optimal Excess Risk Bounds for Empirical Risk Minimization on $p$ -Norm Linear Regression