This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A comment and erratum on “Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?”

Maxime Cauchois, Alnur Ali, and John Duchi
Abstract

We identify and correct an error in the paper “Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?” This correction allows new guarantees on the excess degrees of freedom—the bias in the error estimate of Stein’s unbiased risk estimate (SURE) for an estimator tuned by directly minimizing the SURE criterion—for arbitrary SURE-tuned linear estimators. Oracle inequalities follow as a consequence of these results for such estimators.

1 Introduction and setting

In Tibshirani and Rosset’s paper [4], they consider the Gaussian sequence model, where

Y=θ0+Z,Z𝖭(0,σ2In)Y=\theta_{0}+Z,~{}~{}~{}Z\sim\mathsf{N}(0,\sigma^{2}I_{n}) (1)

for an unknown vector θ0n\theta_{0}\in\mathbb{R}^{n}. Their paper develops theoretical results on the excess optimism, or the amount of downward bias in Stein’s Unbiased Risk Estimate (SURE), when using SURE to select an estimator. In their analysis of subset regression estimators [4, Section 4], there is an error in their calculation of this optimism, which we identify and address in this short note. In addition, we generalize the conclusions of their results, showing how the excess optimism bounds we develop (through their inspiration) extend to essentially arbitrary linear estimators. Their major conclusions and oracle inequalities for SURE-tuned subset regression estimators thus remain true and extend beyond projection estimators.

For any estimator θ^:nn\widehat{\theta}:\mathbb{R}^{n}\to\mathbb{R}^{n} of θ0\theta_{0}, we define the risk

Risk(θ^)𝔼[θ^(Y)θ022].\textup{Risk}\big{(}\widehat{\theta}\,\big{)}\coloneqq\mathbb{E}\left[\|{\widehat{\theta}(Y)-\theta_{0}}\|_{2}^{2}\right].

We study estimation via linear smoothers [2], by which we mean simply that there exists a collection of matrices {Hs}sSn×n\{H_{s}\}_{s\in S}\subset\mathbb{R}^{n\times n}, indexed by sSs\in S, where SS is a finite set. Each of these induces an estimator θ^s(Y)HsY\widehat{\theta}_{s}(Y)\coloneqq H_{s}Y of θ0\theta_{0}, with risk

R(s)Risk(θ^s)=(IHs)θ022+σ2tr(HsTHs)=(IHs)θ022+σ2HsFr2.\displaystyle R(s)\coloneqq\textup{Risk}(\widehat{\theta}_{s})=\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+\sigma^{2}\mathop{\rm tr}(H_{s}^{T}H_{s})=\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+\sigma^{2}\left\|{H_{s}}\right\|_{\rm Fr}^{2}.

We write

pstr(Hs)p_{s}\coloneqq\mathop{\rm tr}(H_{s})

for the (effective) degrees of freedom [3] of θ^s(Y)=HsY\widehat{\theta}_{s}(Y)=H_{s}Y, i.e., df(θ^s)=1σ2i=1nCov(Y^i,Yi)\textup{df}(\widehat{\theta}_{s})=\frac{1}{\sigma^{2}}\sum_{i=1}^{n}\mathop{\rm Cov}(\widehat{Y}_{i},Y_{i}) for Y^i=[θ^(Y)]i=[HsY]i\widehat{Y}_{i}=[\widehat{\theta}(Y)]_{i}=[H_{s}Y]_{i}. The familiar SURE risk criterion is then

SURE(s)YHsY22+2σ2ps,\textup{SURE}(s)\coloneqq\left\|{Y-H_{s}Y}\right\|_{2}^{2}+2\sigma^{2}p_{s},

which satisfies

𝔼[SURE(s)]=(IHs)θ022+𝔼[(IHs)Z22]+2σ2ps=R(s)+nσ2,\mathbb{E}[\textup{SURE}(s)]=\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+\mathbb{E}[\left\|{(I-H_{s})Z}\right\|_{2}^{2}]+2\sigma^{2}p_{s}=R(s)+n\sigma^{2},

so that if YY^{\prime} is an independent draw from the Gaussian sequence model (1), then SURE is an unbiased estimator of the (prediction) error 𝔼[θ^(Y)Y22]\mathbb{E}[\|{\widehat{\theta}(Y)-Y^{\prime}}\|_{2}^{2}].

Let s0Ss_{0}\in S be the oracle choice of estimator,

s0argminsS{Risk(θ^s)},\displaystyle s_{0}\coloneqq\mathop{\rm argmin}_{s\in S}\left\{\textup{Risk}(\widehat{\theta}_{s})\right\},

and let s^(y)\widehat{s}(y) be the estimator minimizing the SURE criterion,

s^(Y)argminsS{SURE(s)=YHsY22+2σ2ps}.\displaystyle\widehat{s}(Y)\coloneqq\mathop{\rm argmin}_{s\in S}\{\textup{SURE}(s)=\left\|{Y-H_{s}Y}\right\|_{2}^{2}+2\sigma^{2}p_{s}\}.

Tibshirani and Rosset [4] study the excess risk that using s^(Y)\widehat{s}(Y) in place of s0s_{0} induces in the estimate θ^s^\widehat{\theta}_{\widehat{s}} of θ0\theta_{0}, defining the excess optimism

ExOpt(θ^s^)Risk(θ^s^)+nσ2𝔼[SURE(s^(Y))]=Risk(θ^s^)+nσ2𝔼[minsSSURE(s)],\textup{ExOpt}(\widehat{\theta}_{\widehat{s}})\coloneqq\textup{Risk}(\widehat{\theta}_{\widehat{s}})+n\sigma^{2}-\mathbb{E}[\textup{SURE}(\widehat{s}(Y))]=\textup{Risk}(\widehat{\theta}_{\widehat{s}})+n\sigma^{2}-\mathbb{E}\left[\min_{s\in S}\textup{SURE}(s)\right],

which they conjecture is always nonnegative. Rearranging this inequality and using that 𝔼[minsSSURE(s)]minsS𝔼[SURE(s)]\mathbb{E}[\min_{s\in S}\textup{SURE}(s)]\leq\min_{s\in S}\mathbb{E}[\textup{SURE}(s)] yields the risk upper bound (cf. [4, Thm. 1]).

Risk(θ^s^)Risk(θ^s0)+ExOpt(θ^s^),\displaystyle\textup{Risk}(\widehat{\theta}_{\widehat{s}})\leq\textup{Risk}(\widehat{\theta}_{s_{0}})+\textup{ExOpt}(\widehat{\theta}_{\widehat{s}}), (2)

so bounding ExOpt suffices to control Risk(θ^s^)\textup{Risk}(\widehat{\theta}_{\widehat{s}}). By a standard calculation [4, Sec. 1],

ExOpt(θ^s^)\displaystyle\textup{ExOpt}(\widehat{\theta}_{\widehat{s}}) =𝔼[θ0Hs^(Y)(Y)22]+nσ2𝔼[θ0+ZHs^(Y)(Y)22+2σ2ps^(Y)]\displaystyle=\mathbb{E}[\|{\theta_{0}-H_{\widehat{s}(Y)}(Y)}\|_{2}^{2}]+n\sigma^{2}-\mathbb{E}[\|{\theta_{0}+Z-H_{\widehat{s}(Y)}(Y)}\|_{2}^{2}+2\sigma^{2}p_{\widehat{s}(Y)}]
=2𝔼[(Hs^(Y)(Y)θ0)TZ]σ2ps^(Y)].\displaystyle=2\mathbb{E}\left[(H_{\widehat{s}(Y)}(Y)-\theta_{0})^{T}Z]-\sigma^{2}p_{\widehat{s}(Y)}\right].

Note that Z=Yθ0Z=Y-\theta_{0} and 𝔼[Z]=0\mathbb{E}[Z]=0, and define the estimated YiY_{i} by Y^i=[Hs^(Y)(Y)]i\widehat{Y}_{i}=[H_{\widehat{s}(Y)}(Y)]_{i}. Then following [4], we see that if we define the excess degrees of freedom

edf(θ^s^)𝔼[1σ2Hs^(Y)(Y)T(Yθ0)ps^(Y)]=(i=1n1σ2Cov(Y^i,Yi))𝔼[ps^(Y)],\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})\coloneqq\mathbb{E}\left[\frac{1}{\sigma^{2}}H_{\widehat{s}(Y)}(Y)^{T}(Y-\theta_{0})-p_{\widehat{s}(Y)}\right]=\bigg{(}\sum_{i=1}^{n}\frac{1}{\sigma^{2}}\mathop{\rm Cov}(\widehat{Y}_{i},Y_{i})\bigg{)}-\mathbb{E}[p_{\widehat{s}(Y)}], (3)

the gap between the effective degrees of freedom and the parameter “count,” then

ExOpt(θ^s^)=2σ2edf(θ^s^).\textup{ExOpt}(\widehat{\theta}_{\widehat{s}})=2\sigma^{2}\textup{edf}(\widehat{\theta}_{\widehat{s}}).

Tibshirani and Rosset [4] study bounds on this excess degrees of freedom.

The excess degrees of freedom and an error

By adding and subtracting Hs^(Y)(θ0)H_{\widehat{s}(Y)}(\theta_{0}) in the definition of the excess degrees of freedom (3), we obtain the equality

edf(θ^s^)=𝔼[1σ2ZTHs^(Y)(Z)ps^(Y)]+1σ2𝔼[Hs^(Y)(θ0)TZ].\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})=\mathbb{E}\left[\frac{1}{\sigma^{2}}Z^{T}H_{\widehat{s}(Y)}(Z)-p_{\widehat{s}(Y)}\right]+\frac{1}{\sigma^{2}}\mathbb{E}\left[H_{\widehat{s}(Y)}(\theta_{0})^{T}Z\right]. (4)

This a corrected version of [4, Equation (41)], as the referenced equation omits the second term. In this note, we provide bounds on this corrected edf quantity. The rough challenge is that, with a naive analysis, the second term in the expansion (4) may scale as σθ02log|S|\sigma\left\|{\theta_{0}}\right\|_{2}\sqrt{\log|S|}; we can prove that it is smaller, and the purpose of Theorem 1 in the next section is to provide a corrected bound on edf(θ^s^)\textup{edf}(\widehat{\theta}_{\widehat{s}}).

Notation

We collect our mostly standard notation here. For functions f,g:𝒜f,g:\mathcal{A}\to\mathbb{R}, where 𝒜\mathcal{A} is any abstract index set, we write f(a)g(a)f(a)\lesssim g(a) to indicate that there is a numerical constant C<C<\infty, independent of a𝒜a\in\mathcal{A}, such that f(a)Cg(a)f(a)\leq Cg(a) for all a𝒜a\in\mathcal{A}. We let log+t=max{0,logt}\log_{+}t=\max\{0,\log t\} for t>0t>0. The operator norm of a matrix AA is |A|op=supu2=1Au2\left|\!\left|\!\left|{A}\right|\!\right|\!\right|_{\rm op}=\sup_{\left\|{u}\right\|_{2}=1}\left\|{Au}\right\|_{2}.

2 A bound on the excess degrees of freedom

We present a bound on the excess degrees of freedom edf(θ^s^)\textup{edf}(\widehat{\theta}_{\widehat{s}}) of the SURE-tuned estimator θ^s^(Y)=Hs^(Y)(Y)\widehat{\theta}_{\widehat{s}(Y)}=H_{\widehat{s}(Y)}(Y), where s^(Y)=argminsSSURE(s)\widehat{s}(Y)=\mathop{\rm argmin}_{s\in S}\textup{SURE}(s). In the theorem, we recall the oracle estimator s0=argminsSRisk(θ^s)s_{0}=\mathop{\rm argmin}_{s\in S}\textup{Risk}(\widehat{\theta}_{s}).

Theorem 1.

Let rr_{\star} satisfy 1σ2R(s0)=1σ2(IHs0)θ022+Hs0Fr2r\frac{1}{\sigma^{2}}R(s_{0})=\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}. Assume additionally that |Hs|ophop\left|\!\left|\!\left|{H_{s}}\right|\!\right|\!\right|_{\rm op}\leq h_{\textup{op}} for all sSs\in S, where hop1h_{\textup{op}}\geq 1. Then

edf(θ^s^)rlog|S|+hoplog|S|(1+log+hop2log|S|r).\textup{edf}(\widehat{\theta}_{\widehat{s}})\lesssim\sqrt{r_{\star}\log|S|}+h_{\textup{op}}\log|S|\cdot\left(1+\log_{+}\frac{h_{\textup{op}}^{2}\log|S|}{r_{\star}}\right).

We prove the theorem in Section 2.3, providing a bit of commentary here by discussing subset regression estimators, as in the paper [4, Sec. 4], then discussing more general smoothers and estimates [2]. As we shall see, the assumption that the matrices HsH_{s} have bounded operator norm is little loss of generality [2, Sec. 2.5].

2.1 Subset regression and projection estimators

We first revisit the setting that Tibshirani and Rosset [4, Sec. 4] consider, when each matrix HsH_{s} is an orthogonal projector. As a motiviating example, in linear regression, we may take Hs=Xs(XsTXs)1XsTH_{s}=X_{s}(X_{s}^{T}X_{s})^{-1}X_{s}^{T}, where XsX_{s} denotes the submatrix of the design Xn×pX\in\mathbb{R}^{n\times p} whose columns ss indexes. In this case, as Hs2=HsH_{s}^{2}=H_{s}, we have |Hs|op1\left|\!\left|\!\left|{H_{s}}\right|\!\right|\!\right|_{\rm op}\leq 1 and tr(Hs)=HsFr2=ps\mathop{\rm tr}(H_{s})=\left\|{H_{s}}\right\|_{\rm Fr}^{2}=p_{s}. First consider the case that there is ss^{\star} satisfying Hsθ0=θ0H_{s^{\star}}\theta_{0}=\theta_{0}, that is, the true model belongs to the set SS. Then because s0s_{0} minimizes 1σ2(IHs)θ022+ps\frac{1}{\sigma^{2}}\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+p_{s}, we have

edf(θ^s^)pslog|S|+log|S|(1+log+log|S|ps).\textup{edf}(\widehat{\theta}_{\widehat{s}})\lesssim\sqrt{p_{s^{\star}}\log|S|}+\log|S|\cdot\left(1+\log_{+}\frac{\log|S|}{p_{s^{\star}}}\right).

Whenever pslog|S|p_{s^{\star}}\gtrsim\log|S|, we have the simplified bound

edf(θ^s^)pslog|S|.\textup{edf}(\widehat{\theta}_{\widehat{s}})\lesssim\sqrt{p_{s^{\star}}\log|S|}.

An alternative way to look at the result is by replacing rr_{\star} with 1σ2R(s0)=1σ2minsSR(s)\frac{1}{\sigma^{2}}R(s_{0})=\frac{1}{\sigma^{2}}\min_{s\in S}R(s). In this case, combining [4, Thm. 1] with Theorem 1, we obtain

Risk(θ^s^)\displaystyle\textup{Risk}(\widehat{\theta}_{\widehat{s}}) R(s0)+C(R(s0)σ2log|S|+σ2log|S|(1+log+σ2log|S|R(s0)))\displaystyle\leq R(s_{0})+C\left(\sqrt{R(s_{0})\cdot\sigma^{2}\log|S|}+\sigma^{2}\log|S|\cdot\left(1+\log_{+}\frac{\sigma^{2}\log|S|}{R(s_{0})}\right)\right)

for a (numerical) constant CC. In particular, if minsSRisk(θ^s)σ2log|S|\min_{s\in S}\textup{Risk}(\widehat{\theta}_{s})\gtrsim\sigma^{2}\log|S|, then we obtain that there exists a numerical constant CC such that for all η>0\eta>0,

Risk(θ^s^)\displaystyle\textup{Risk}(\widehat{\theta}_{\widehat{s}}) (1+Cη)minsSRisk(θ^s)+Cσ2log|S|η.\displaystyle\leq(1+C\eta)\min_{s\in S}\textup{Risk}(\widehat{\theta}_{s})+C\frac{\sigma^{2}\log|S|}{\eta}. (5)

Additional perspective comes by considering some of the bounds Tibshirani and Rosset [4, Sec. 4] develop. While we cannot recover identical results because of the correction term (4) in the excess degrees of freedom, we can recover analogues. As one example, consider their Theorem 2. It assumes that nn\to\infty in the sequence model (1), and (implicitly letting SS and s0s_{0} depend on nn) that Risk(θ^s0)/log|S|\textup{Risk}(\widehat{\theta}_{s_{0}})/\log|S|\to\infty. Then inequality (5) immediately gives the oracle inequality Risk(θ^s^)=(1+o(1))Risk(θ^s0)\textup{Risk}(\widehat{\theta}_{\widehat{s}})=(1+o(1))\textup{Risk}(\widehat{\theta}_{s_{0}}), and Theorem 1 moreover shows that

edf(θ^s^)1σ2Risk(θ^s0)σ2log|S|Risk(θ^s0)0.\frac{\textup{edf}(\widehat{\theta}_{\widehat{s}})}{\frac{1}{\sigma^{2}}\textup{Risk}(\widehat{\theta}_{s_{0}})}\lesssim\sqrt{\sigma^{2}\frac{\log|S|}{\textup{Risk}(\widehat{\theta}_{s_{0}})}}\to 0.

In brief, Tibshirani and Rosset’s major conclusions on subset regression estimators—and projection estimators more generally—hold, but with a few tweaks.

2.2 Examples of more general smoothers

In more generality, the matrices Hsn×nH_{s}\in\mathbb{R}^{n\times n} defining the estimators may be arbitrary. In common cases, however, they are reasonably well-behaved. Take as motivation nonparametric regression, where Yi=f(Xi)+εiY_{i}=f(X_{i})+\varepsilon_{i}, and we let θi=f(Xi)\theta_{i}=f(X_{i}) for i=1,,ni=1,\ldots,n and Xi𝒳X_{i}\in\mathcal{X}, so that Risk(θ^)\textup{Risk}(\widehat{\theta}) measures the in-sample risk of an estimator θ^\widehat{\theta} of [f(Xi)]i=1n[f(X_{i})]_{i=1}^{n}. Here, standard estimators include kernel regression and local averaging [2, 6], both of which we touch on.

First consider kernel ridge regression (KRR). For a reproducing kernel K:𝒳×𝒳K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}, the (PSD) Gram matrix GG has entries Gij=K(Xi,Xj)G_{ij}=K(X_{i},X_{j}). Then for λ+\lambda\in\mathbb{R}_{+}, we define Hλ=(G+λI)1GH_{\lambda}=(G+\lambda I)^{-1}G, which is symmetric and positive semidefinite, and satisfies |Hλ|op1\left|\!\left|\!\left|{H_{\lambda}}\right|\!\right|\!\right|_{\rm op}\leq 1. Assume the collection Λ+\Lambda\subset\mathbb{R}_{+} is finite and define the effective dimension pλ=tr(G(G+λI)1)p_{\lambda}=\mathop{\rm tr}(G(G+\lambda I)^{-1}), yielding that SURE(λ)=YHλY22+2σ2pλ\textup{SURE}(\lambda)=\|{Y-H_{\lambda}Y}\|_{2}^{2}+2\sigma^{2}p_{\lambda}. Then SURE-tuned KRR, with regularization λ^=argminλΛSURE(λ)\widehat{\lambda}=\mathop{\rm argmin}_{\lambda\in\Lambda}\textup{SURE}(\lambda) and θ^λ^=Hλ^Y\widehat{\theta}_{\widehat{\lambda}}=H_{\widehat{\lambda}}Y, satisfies

edf(θ^λ^)rlog|Λ|+log|Λ|(1+log+log|Λ|r)\textup{edf}(\widehat{\theta}_{\widehat{\lambda}})\lesssim\sqrt{r_{\star}\log|\Lambda|}+\log|\Lambda|\cdot\left(1+\log_{+}\frac{\log|\Lambda|}{r_{\star}}\right)

where as usual, r=minλΛ{1σ2(IHλ)θ022+tr(G(G+λI)1)}r_{\star}=\min_{\lambda\in\Lambda}\{\frac{1}{\sigma^{2}}\left\|{(I-H_{\lambda})\theta_{0}}\right\|_{2}^{2}+\mathop{\rm tr}(G(G+\lambda I)^{-1})\}.

A second example arises from kk-nearest-neighbor (kk-nn) estimators. We take S={1,,n}S=\{1,\ldots,n\} to indicate the number of nearest neighbors to average, and for kSk\in S let 𝒩k(i)\mathcal{N}_{k}(i) denote the indices of the kk nearest neighbors XjX_{j} to XiX_{i} in {X1,,Xn}\{X_{1},\ldots,X_{n}\}, so 𝒩k1(i)={ji𝒩k(j)}\mathcal{N}_{k}^{-1}(i)=\{j\mid i\in\mathcal{N}_{k}(j)\} are the indices jj for which XiX_{i} is a neighbor of XjX_{j}. Then the matrix Hk+n×nH_{k}\in\mathbb{R}^{n\times n}_{+} satisfies [Hk]ij=1k1{j𝒩k(i)}[H_{k}]_{ij}=\frac{1}{k}1\!\left\{j\in\mathcal{N}_{k}(i)\right\}, and we claim that |Hk|op1kmaxi|𝒩k1(i)|\left|\!\left|\!\left|{H_{k}}\right|\!\right|\!\right|_{\rm op}\leq\frac{1}{k}\max_{i}|\mathcal{N}_{k}^{-1}(i)|. Indeed,

[HkTHk]ij=1k2l=1n1{i𝒩k(l),j𝒩k(l)},[H_{k}^{T}H_{k}]_{ij}=\frac{1}{k^{2}}\sum_{l=1}^{n}1\!\left\{i\in\mathcal{N}_{k}(l),j\in\mathcal{N}_{k}(l)\right\},

and as HkTHkH_{k}^{T}H_{k} is elementwise nonnegative, the Gershgorin circle theorem guarantees

|Hk|op2maxinj=1n[HkTHk]ij\displaystyle\left|\!\left|\!\left|{H_{k}}\right|\!\right|\!\right|_{\rm op}^{2}\leq\max_{i\leq n}\sum_{j=1}^{n}[H_{k}^{T}H_{k}]_{ij} 1k2maxinl=1nj=1n1{i𝒩k(l),j𝒩k(l)}1kmaxinl=1n1{i𝒩k(l)}\displaystyle\leq\frac{1}{k^{2}}\max_{i\leq n}\sum_{l=1}^{n}\sum_{j=1}^{n}1\!\left\{i\in\mathcal{N}_{k}(l),j\in\mathcal{N}_{k}(l)\right\}\leq\frac{1}{k}\max_{i\leq n}\sum_{l=1}^{n}1\!\left\{i\in\mathcal{N}_{k}(l)\right\}

as |𝒩k(l)|k|\mathcal{N}_{k}(l)|\leq k for each ll. Additionally, we have HkFr2=nkk2=nk\left\|{H_{k}}\right\|_{\rm Fr}^{2}=\frac{nk}{k^{2}}=\frac{n}{k}. The normalized risk of the kk-nn estimator is then rk=1σ2(IHk)θ022+nkr_{k}=\frac{1}{\sigma^{2}}\left\|{(I-H_{k})\theta_{0}}\right\|_{2}^{2}+\frac{n}{k}, and certainly rklognr_{k}\geq\log n whenever knlognk\leq\frac{n}{\log n}. Under a few restrictions, we can therefore obtain an oracle-type inequality: assume the points {X1,,Xn}\{X_{1},\ldots,X_{n}\} are regular enough that maxi|𝒩k1(i)|k\max_{i}|\mathcal{N}_{k}^{-1}(i)|\lesssim k for knlognk\leq\frac{n}{\log n}. Then the SURE-tuned kk-nearest-neighbor estimator θ^k\widehat{\theta}_{k} satisfies the bound

edf(θ^k^)1σ2minkn/lognRisk(θ^k)logn\textup{edf}(\widehat{\theta}_{\widehat{k}})\lesssim\sqrt{\frac{1}{\sigma^{2}}\min_{k\leq n/\log n}\textup{Risk}(\widehat{\theta}_{k})\cdot\log n}

on its excess degrees of freedom. Via inequality (2), this implies the oracle inequality

Risk(θ^k^)minkn/logn(1+Cη)Risk(θ^k)+Cσ2lognηfor all η>0.\textup{Risk}(\widehat{\theta}_{\widehat{k}})\leq\min_{k\leq n/\log n}\left(1+C\eta\right)\textup{Risk}(\widehat{\theta}_{k})+\frac{C\sigma^{2}\log n}{\eta}~{}~{}~{}\mbox{for~{}all~{}}\eta>0.

2.3 Proof of Theorem 1

Our proof strategy is familiar from high-dimensional statistics [cf. 6, Chs. 7 & 9]: we develop a basic inequality relating the risk of s^\widehat{s} to that of s0s_{0}, then apply a peeling argument [5] to bound the probability that relative error bounds deviate far from their expectations. Throughout, we let CC denote a universal (numerical) constant whose value may change from line to line.

To prove the theorem, we first recall a definition and state two auxiliary lemmas.

Definition 2.1.

A mean-zero random variable XX is (τ2,b)(\tau^{2},b)-sub-exponential if 𝔼[eλX]exp(λ2τ22)\mathbb{E}[e^{\lambda X}]\leq\exp(\frac{\lambda^{2}\tau^{2}}{2}) for |λ|1b|\lambda|\leq\frac{1}{b}. If b=0b=0, then XX is τ2\tau^{2}-sub-Gaussian.

Lemma 2.1.

Let XiX_{i}, i=1,,Ni=1,\ldots,N, be τ2\tau^{2}-sub-Gaussian and k1k\geq 1. Then

𝔼[maxiN|Xi|k]2τkmax{(2logN)k/2,kk/2}.\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]\leq 2\cdot\tau^{k}\max\left\{(2\log N)^{k/2},k^{k/2}\right\}.

Let XiX_{i}, i=1,,Ni=1,\ldots,N, be (τ2,b)(\tau^{2},b)-sub-exponential and k1k\geq 1. Then

𝔼[maxiN|Xi|k]1/kmax{τ2logN,blogN,τ2k,bk}.\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]^{1/k}\lesssim\max\left\{\sqrt{\tau^{2}\log N},b\log N,\sqrt{\tau^{2}k},bk\right\}.

The second statement of the lemma generalizes the first without specifying constants. We also use that quadratic forms of Gaussian random vectors are sub-exponential.

Lemma 2.2.

Let Z𝖭(0,In×n)Z\sim\mathsf{N}(0,I_{n\times n}) and An×nA\in\mathbb{R}^{n\times n}. Then ZTAZZ^{T}AZ is (tr((A+AT)2),2|A+AT|op)(\mathop{\rm tr}((A+A^{T})^{2}),2|\!|\!|{A+A^{T}}|\!|\!|_{\rm op})-sub-exponential. Additionally, ZTAZZ^{T}AZ is (AFr2,4|A|op)(\left\|{A}\right\|_{\rm Fr}^{2},4|\!|\!|{A}|\!|\!|_{\rm op})-sub-exponential.

We defer the proofs of Lemmas 2.1 and 2.2 to Sections 2.4 and 2.5, respectively.

Recall the notation R(s)=(HsI)θ022+σ2psR(s)=\left\|{(H_{s}-I)\theta_{0}}\right\|_{2}^{2}+\sigma^{2}p_{s}, and for sSs\in S define the centered variables

Ws1σ2ZT(2HsHsTHs)Z+tr(HsTHs)2ps,Zs1σ2θ0T(HsI)T(IHs)Z.\begin{split}W_{s}&\coloneqq\frac{1}{\sigma^{2}}Z^{T}(2H_{s}-H_{s}^{T}H_{s})Z+\mathop{\rm tr}(H_{s}^{T}H_{s})-2p_{s},\\ Z_{s}&\coloneqq\frac{1}{\sigma^{2}}\theta_{0}^{T}(H_{s}-I)^{T}(I-H_{s})Z.\end{split} (6)

A quick calculation shows that for every sSs\in S,

1σ2SURE(s)\displaystyle\frac{1}{\sigma^{2}}\textup{SURE}(s) =1σ2R(s)+1σ2Z22Ws2Zs.\displaystyle=\frac{1}{\sigma^{2}}R(s)+\frac{1}{\sigma^{2}}\left\|{Z}\right\|_{2}^{2}-W_{s}-2Z_{s}.

As s^(Y)\widehat{s}(Y) minimizes the SURE criterion, we therefore have the basic inequality

1σ2(R(s^(Y))R(s0))Ws^(Y)Ws0+2(Zs^(Y)Zs0).\frac{1}{\sigma^{2}}\left(R(\widehat{s}(Y))-R(s_{0})\right)\leq W_{\widehat{s}(Y)}-W_{s_{0}}+2(Z_{\widehat{s}(Y)}-Z_{s_{0}}). (7)

We now provide a peeling argument using inequality (7). For each ll\in\mathbb{N} define the shell

𝒮l{sS(2l1)σ2rR(s)R(s0)(2l+11)σ2r}.\displaystyle\mathcal{S}_{l}\coloneqq\{s\in S\mid(2^{l}-1)\sigma^{2}r_{\star}\leq R(s)-R(s_{0})\leq(2^{l+1}-1)\sigma^{2}r_{\star}\}.

The key result is the following lemma.

Lemma 2.3.

There exists a numerical constant c>0c>0 such that

(s^(Y)𝒮l)2|𝒮l|exp(c2lrhop2).\mathbb{P}(\widehat{s}(Y)\in\mathcal{S}_{l})\leq 2|\mathcal{S}_{l}|\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right).

Proof    Note that if s^(Y)𝒮l\widehat{s}(Y)\in\mathcal{S}_{l}, we have

maxs𝒮l(WsWs0)+2maxs𝒮l(ZsZs0)maxs𝒮l(WsWs0+2(ZsZs0))r(2l1),\displaystyle\max_{s\in\mathcal{S}_{l}}\left(W_{s}-W_{s_{0}}\right)+2\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\max_{s\in\mathcal{S}_{l}}\left(W_{s}-W_{s_{0}}+2(Z_{s}-Z_{s_{0}})\right)\geq r_{\star}(2^{l}-1),

and so it must be the case that at least one of

maxs𝒮lWsWs012r(2l1)ormaxs𝒮l(ZsZs0)14r(2l1)\max_{s\in\mathcal{S}_{l}}W_{s}-W_{s_{0}}\geq\frac{1}{2}r_{\star}(2^{l}-1)~{}~{}~{}\mbox{or}~{}~{}~{}\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\frac{1}{4}r_{\star}(2^{l}-1) (8)

occurs. We can thus bound the probability that s^(Y)𝒮l\widehat{s}(Y)\in\mathcal{S}_{l} by bounding the probabilities of each of the events (8); to do this, we that WsWs0W_{s}-W_{s_{0}} and ZsZs0Z_{s}-Z_{s_{0}} concentrate for s𝒮ls\in\mathcal{S}_{l}.

As promised, we now show that WsWs0W_{s}-W_{s_{0}} and ZsZs0Z_{s}-Z_{s_{0}} are sub-exponential and sub-Gaussian, respectively (recall Definition 2.1). Observe that

1σ2(R(s)R(s0))\displaystyle\frac{1}{\sigma^{2}}(R(s)-R(s_{0})) =1σ2(IHs)θ0221σ2(IHs0)θ022+HsFr2Hs0Fr2\displaystyle=\frac{1}{\sigma^{2}}\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}-\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s}}\right\|_{\rm Fr}^{2}-\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}
HsFr2r\displaystyle\geq\left\|{H_{s}}\right\|_{\rm Fr}^{2}-r_{\star} (9)

by assumption that r1σ2R(s0)=1σ2(IHs0)θ022+Hs0Fr2r_{\star}\geq\frac{1}{\sigma^{2}}R(s_{0})=\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2} and that (IHs)θ0220\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}\geq 0. In particular, inequality (9) shows that for each s𝒮ls\in\mathcal{S}_{l} we have HsFr2r(2l+11)r\left\|{H_{s}}\right\|_{\rm Fr}^{2}-r_{\star}\leq(2^{l+1}-1)r_{\star} (and we always have Hs0Fr2r\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}), so that

HsFr22l+1randHs0Fr2rfor s𝒮l.\left\|{H_{s}}\right\|_{\rm Fr}^{2}\leq 2^{l+1}r_{\star}~{}~{}~{}\mbox{and}~{}~{}~{}\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}~{}~{}~{}\mbox{for~{}}s\in\mathcal{S}_{l}. (10)

For each ss we have

WsWs0=1σ2ZT(2Hs2Hs0HsTHs+Hs0THs0)Z+HsFr2Hs0Fr22(psps0),W_{s}-W_{s_{0}}=\frac{1}{\sigma^{2}}Z^{T}(2H_{s}-2H_{s_{0}}-H_{s}^{T}H_{s}+H_{s_{0}}^{T}H_{s_{0}})Z+\left\|{H_{s}}\right\|_{\rm Fr}^{2}-\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}-2(p_{s}-p_{s_{0}}),

while

2Hs2Hs0HsTHs+Hs0THs0Fr24(4HsFr2+4Hs0Fr2+hop2HsFr2+hop2Hs0Fr2)\left\|{2H_{s}-2H_{s_{0}}-H_{s}^{T}H_{s}+H_{s_{0}}^{T}H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq 4\cdot\left(4\left\|{H_{s}}\right\|_{\rm Fr}^{2}+4\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}+h_{\textup{op}}^{2}\left\|{H_{s}}\right\|_{\rm Fr}^{2}+h_{\textup{op}}^{2}\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\right)

by assumption that |Hs|ophop\left|\!\left|\!\left|{H_{s}}\right|\!\right|\!\right|_{\rm op}\leq h_{\textup{op}} for all sSs\in S. Lemma 2.2 and the bounds (10) on HsFr\left\|{H_{s}}\right\|_{\rm Fr} thus give that WsWs0W_{s}-W_{s_{0}} is (C2lhop2r,Chop2)(C\cdot 2^{l}h_{\textup{op}}^{2}r_{\star},C\cdot h_{\textup{op}}^{2})-sub-exponential, so that for t0t\geq 0, a Chernoff bound implies

(maxs𝒮l(WsWs0)t)\displaystyle\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(W_{s}-W_{s_{0}})\geq t\right) |𝒮l|exp(C2lλ2hop2rλt),\displaystyle\leq|\mathcal{S}_{l}|\exp\left(C\cdot 2^{l}\lambda^{2}h_{\textup{op}}^{2}r_{\star}-\lambda t\right),

valid for 0λ1Chop0\leq\lambda\leq\frac{1}{Ch_{\textup{op}}}. Taking t=12(2l1)rt=\frac{1}{2}(2^{l}-1)r_{\star} and λ=1Chop2\lambda=\frac{1}{C^{\prime}h_{\textup{op}}^{2}} yields that for a numerical constant c>0c>0,

(maxs𝒮l(WsWs0)12(2l1)r)|𝒮l|exp(c2lrhop2).\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(W_{s}-W_{s_{0}})\geq\frac{1}{2}(2^{l}-1)r_{\star}\right)\leq|\mathcal{S}_{l}|\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right).

We can provide a similar bound on ZsZs0Z_{s}-Z_{s_{0}} for s𝒮ls\in\mathcal{S}_{l}. It is immediate that Zs𝖭(0,1σ2(HsI)T(IHs)θ022)Z_{s}\sim\mathsf{N}(0,\frac{1}{\sigma^{2}}\left\|{(H_{s}-I)^{T}(I-H_{s})\theta_{0}}\right\|_{2}^{2}). Using inequality (9) and that 1σ2(IHs0)θ022+Hs0Fr2r\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}, for each s𝒮ls\in\mathcal{S}_{l} we have

1σ2(IHs)θ022\displaystyle\frac{1}{\sigma^{2}}\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2} =1σ2(R(s)R(s0))+1σ2(IHs0)θ022+Hs0Fr2HsFr2\displaystyle=\frac{1}{\sigma^{2}}(R(s)-R(s_{0}))+\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}-\left\|{H_{s}}\right\|_{\rm Fr}^{2}
r(2l+11)+rps2l+1r.\displaystyle\leq r_{\star}(2^{l+1}-1)+r_{\star}-p_{s}\leq 2^{l+1}r_{\star}.

Similarly, Zs0𝖭(0,1σ2(Hs0I)T(IHs0)θ022)Z_{s_{0}}\sim\mathsf{N}(0,\frac{1}{\sigma^{2}}\left\|{(H_{s_{0}}-I)^{T}(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}) and 1σ2(IHs0)θ022r\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}\leq r_{\star}. Using that |IHs|op(1+hop)\left|\!\left|\!\left|{I-H_{s}}\right|\!\right|\!\right|_{\rm op}\leq(1+h_{\textup{op}}), for each s𝒮ls\in\mathcal{S}_{l} we have ZsZs0𝖭(0,τ2(s))Z_{s}-Z_{s_{0}}\sim\mathsf{N}(0,\tau^{2}(s)) for some τ2(s)Chop22lr\tau^{2}(s)\leq C\cdot h_{\textup{op}}^{2}2^{l}r_{\star}. This yields the bound

(maxs𝒮l(ZsZs0)14r(2l1))\displaystyle\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\frac{1}{4}r_{\star}(2^{l}-1)\right) |𝒮l|exp(cr(2l1)22lhop2)|𝒮l|exp(c2lrhop2),\displaystyle\leq|\mathcal{S}_{l}|\exp\left(-c\frac{r_{\star}(2^{l}-1)^{2}}{2^{l}h_{\textup{op}}^{2}}\right)\leq|\mathcal{S}_{l}|\exp\left(-c^{\prime}\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right),

where c,c>0c,c^{\prime}>0 are numerical constants. Returning to the events (8), we have shown

(s^(Y)𝒮l)\displaystyle\mathbb{P}(\widehat{s}(Y)\in\mathcal{S}_{l}) (maxs𝒮l(WsWs0)12(2l1)r)+(maxs𝒮l(ZsZs0)14r(2l1))\displaystyle\leq\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(W_{s}-W_{s_{0}})\geq\frac{1}{2}(2^{l}-1)r_{\star}\right)+\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\frac{1}{4}r_{\star}(2^{l}-1)\right)
2|𝒮l|exp(c2lrhop2)\displaystyle\leq 2|\mathcal{S}_{l}|\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right)

as desired. ∎

We leverage the probability bound in Lemma 2.3 to give our final guarantees. We expand Define the (centered) linear and quadratic terms

Qs1σ2ZTHsZpsandLs1σ2ZTHsθ0,Q_{s}\coloneqq\frac{1}{\sigma^{2}}Z^{T}H_{s}Z-p_{s}~{}~{}~{}\mbox{and}~{}~{}~{}L_{s}\coloneqq\frac{1}{\sigma^{2}}Z^{T}H_{s}\theta_{0},

so that edf(θ^s^)=𝔼[Qs^(Y)]+𝔼[Ls^(Y)]\textup{edf}(\widehat{\theta}_{\widehat{s}})=\mathbb{E}[Q_{\widehat{s}(Y)}]+\mathbb{E}[L_{\widehat{s}(Y)}]. Expanding this equality, we have

edf(θ^s^)=l=0𝔼[Qs^(Y)1{s^(Y)𝒮l}]+𝔼[Ls^(Y)1{s^(Y)𝒮l}].\textup{edf}(\widehat{\theta}_{\widehat{s}})=\sum_{l=0}^{\infty}\mathbb{E}[Q_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}]+\mathbb{E}[L_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}].

As in the proof of Lemma 2.3, the bounds (10) that HsFr22l+1r\left\|{H_{s}}\right\|_{\rm Fr}^{2}\leq 2^{l+1}r_{\star} and Lemma 2.2 guarantee that QsQ_{s} is (2l+3r,4hop)(2^{l+3}r_{\star},4h_{\textup{op}})-sub-exponential. Thus we have

𝔼[Qs^(Y)1{s^(Y)𝒮l}]\displaystyle\mathbb{E}\left[Q_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}\right] 𝔼[maxs𝒮lQs1{s^(Y)𝒮l}](i)𝔼[maxs𝒮lQs2]1/2(s^(Y)𝒮l)1/2\displaystyle\leq\mathbb{E}\left[\max_{s\in\mathcal{S}_{l}}Q_{s}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}\right]\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbb{E}\left[\max_{s\in\mathcal{S}_{l}}Q_{s}^{2}\right]^{1/2}\mathbb{P}(\widehat{s}(Y)\in\mathcal{S}_{l})^{1/2}
(ii)max{2lrlog|𝒮l|,hoplog|𝒮l|}min{1,|𝒮l|exp(c2lrhop2)},\displaystyle\stackrel{{\scriptstyle(ii)}}{{\lesssim}}\max\left\{\sqrt{2^{l}r_{\star}\log|\mathcal{S}_{l}|},h_{\textup{op}}\log|\mathcal{S}_{l}|\right\}\min\left\{1,\sqrt{|\mathcal{S}_{l}|}\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right)\right\},

where inequality (i)(i) is Cauchy-Schwarz and inequality (ii)(ii) follows by combining Lemma 2.1 (take k=2k=2) and Lemma 2.3. We similarly have that LsL_{s} is C2lrC\cdot 2^{l}r_{\star}-sub-Gaussian, yielding

𝔼[Ls^(Y)1{s^(Y)𝒮l}]2lrlog|𝒮l|min{1,|𝒮l|exp(c2lrhop2)}.\mathbb{E}[L_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}]\lesssim\sqrt{2^{l}r_{\star}\log|\mathcal{S}_{l}|}\min\left\{1,\sqrt{|\mathcal{S}_{l}|}\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right)\right\}.

Temporarily introduce the shorthand r=rhop2r=\frac{r_{\star}}{h_{\textup{op}}^{2}}. Substituting these bounds into the edf(θ^s^)\textup{edf}(\widehat{\theta}_{\widehat{s}}) expansion above and naively bounding |𝒮l||S||\mathcal{S}_{l}|\leq|S| yields

edf(θ^s^)\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})
rlog|S|02tmin{1,|S|exp(c2tr)}𝑑t+hoplog|S|0min{1,|S|exp(c2tr)}𝑑t\displaystyle\lesssim\sqrt{r_{\star}\log|S|}\int_{0}^{\infty}\sqrt{2^{t}\min\{1,|S|\exp(-c2^{t}r)\}}dt+h_{\textup{op}}\log|S|\int_{0}^{\infty}\sqrt{\min\{1,|S|\exp(-c2^{t}r)\}}dt
+rlog|S|+hoplog|S|\displaystyle\qquad\qquad~{}+\sqrt{r_{\star}\log|S|}+h_{\textup{op}}\log|S|
log|S|cr/2u12min{1,|S|eu}𝑑u+hoplog|S|cr/21umin{1,|S|eu}𝑑u\displaystyle\lesssim\sqrt{\log|S|}\int_{cr/2}^{\infty}u^{-\frac{1}{2}}\min\{1,\sqrt{|S|}e^{-u}\}du+h_{\textup{op}}\log|S|\int_{cr/2}^{\infty}\frac{1}{u}\min\{1,\sqrt{|S|}e^{-u}\}du
+rlog|S|+hoplog|S|,\displaystyle\qquad\qquad~{}+\sqrt{r_{\star}\log|S|}+h_{\textup{op}}\log|S|,

where we made the substitution u=c2t1ru=c2^{t-1}r. We break each of the integrals into the regions 12cru12max{cr,log|S|}\frac{1}{2}cr\leq u\leq\frac{1}{2}\max\{cr,\log|S|\} and u12max{cr,log|S|}u\geq\frac{1}{2}\max\{cr,\log|S|\}. Thus

log|S|cr/2u12min{1,|S|eu}𝑑u\displaystyle\sqrt{\log|S|}\int_{cr/2}^{\infty}u^{-\frac{1}{2}}\min\{1,\sqrt{|S|}e^{-u}\}du {2log|S|+2iflog|S|cr2if crlog|S|.\displaystyle\leq\begin{cases}\sqrt{2}\log|S|+\sqrt{2}&\mbox{if}~{}\log|S|\geq cr\\ \sqrt{2}&\mbox{if~{}}cr\geq\log|S|.\end{cases}

where we have used bu1/2eu𝑑ub1/2eb\int_{b}^{\infty}u^{-1/2}e^{-u}du\leq b^{-1/2}e^{-b} for b>0b>0 and that |S|log|S|cr/2exp(12cr)2\frac{\sqrt{|S|\log|S|}}{\sqrt{cr/2}}\exp(-\frac{1}{2}cr)\leq\sqrt{2} whenever crlog|S|cr\geq\log|S|. For the second integral, we have

log|S|cr/21umin{1,|S|eu}𝑑u\displaystyle\log|S|\int_{cr/2}^{\infty}\frac{1}{u}\min\{1,\sqrt{|S|}e^{-u}\}du {log|S|loglog|S|cr+2if log|S|cr2ifcrlog|S|,\displaystyle\leq\begin{cases}\log|S|\cdot\log\frac{\log|S|}{cr}+2&\mbox{if~{}}\log|S|\geq cr\\ 2&\mbox{if}~{}cr\geq\log|S|,\end{cases}

where we have used that b1ueu𝑑ueb/b\int_{b}^{\infty}\frac{1}{u}e^{-u}du\leq e^{-b}/b for any b0b\geq 0. Replacing r=r/hop2r=r_{\star}/h_{\textup{op}}^{2}, Theorem 1 follows.

2.4 Proof of Lemma 2.1

For the first statement, without loss of generality by scaling, we may assume σ2=1\sigma^{2}=1. Then for any t00t_{0}\geq 0, we may write

𝔼[maxiN|Xi|k]\displaystyle\mathbb{E}[\max_{i\leq N}|X_{i}|^{k}] =0(maxiN|Xi|t1/k)𝑑t\displaystyle=\int_{0}^{\infty}\mathbb{P}\left(\max_{i\leq N}|X_{i}|\geq t^{1/k}\right)dt
t0+2Nt0exp(t2/k2)𝑑t\displaystyle\leq t_{0}+2N\int_{t_{0}}^{\infty}\exp\left(-\frac{t^{2/k}}{2}\right)dt
=t0+2k/2Nkt02/k/2uk/21eu𝑑u,\displaystyle=t_{0}+2^{k/2}Nk\int_{t_{0}^{2/k}/2}^{\infty}u^{k/2-1}e^{-u}du,

where we have made the substitution u=t2/k/2u=t^{2/k}/2, or 2k/2uk/2=t2^{k/2}u^{k/2}=t. Using [1, Eq. (1.5)], which states that xxa1ex𝑑x2xa1ex\int_{x}^{\infty}x^{a-1}e^{-x}dx\leq 2x^{a-1}e^{-x} for x>a1x>a-1, we obtain that

𝔼[maxiN|Xi|k]\displaystyle\mathbb{E}[\max_{i\leq N}|X_{i}|^{k}] t0+2k/2+1Nk(12t02/k)k/21exp(t02/k2)\displaystyle\leq t_{0}+2^{k/2+1}Nk\left(\frac{1}{2}t_{0}^{2/k}\right)^{k/2-1}\exp\left(-\frac{t_{0}^{2/k}}{2}\right)

whenever t0kk/2t_{0}\geq k^{k/2}. Take t0=max{(2logN)k/2,kk/2}t_{0}=\max\{(2\log N)^{k/2},k^{k/2}\} to achieve the result.

The second bound is more subtle. First, we note that (Xi/τ)(X_{i}/\tau) is (1,b/τ)(1,b/\tau)-sub-exponential. We therefore prove the bound for (1,b)(1,b)-sub-exponential random variables, rescaling at the end. Following a similar argument to that above, note that (|Xi|t)2exp(min{t2b,t22})\mathbb{P}(|X_{i}|\geq t)\leq 2\exp(-\min\{\frac{t}{2b},\frac{t^{2}}{2}\}), and so

𝔼[|Xi|k]\displaystyle\mathbb{E}[|X_{i}|^{k}] 0(|Xi|t1/k)𝑑t20exp(min{t1/k2b,t2/k2})𝑑t\displaystyle\leq\int_{0}^{\infty}\mathbb{P}(|X_{i}|\geq t^{1/k})dt\leq 2\int_{0}^{\infty}\exp\left(-\min\left\{\frac{t^{1/k}}{2b},\frac{t^{2/k}}{2}\right\}\right)dt
=20bkexp(t2/k2)𝑑t+2bkexp(t1/k2b)𝑑t.\displaystyle=2\int_{0}^{b^{-k}}\exp\left(-\frac{t^{2/k}}{2}\right)dt+2\int_{b^{-k}}^{\infty}\exp\left(-\frac{t^{1/k}}{2b}\right)dt.

Making the substitution u=t2/k/2u=t^{2/k}/2, or kuk/21du=dtku^{k/2-1}du=dt in the first integral, and u=t1/k/(2b)u=t^{1/k}/(2b), or (2b)kkuk1du=dt(2b)^{k}ku^{k-1}du=dt in the second, we obtain the bounds

𝔼[|Xi|k]\displaystyle\mathbb{E}[|X_{i}|^{k}] 2k0b2/2uk/21eu𝑑u+2k+1bkkb2/2uk1eu𝑑u.\displaystyle\leq 2k\int_{0}^{b^{-2}/2}u^{k/2-1}e^{-u}du+2^{k+1}b^{k}k\int_{b^{-2}/2}^{\infty}u^{k-1}e^{-u}du.

The first integral term always has upper bound 2kΓ(k/2)=4Γ(k/2+1)2k\Gamma(k/2)=4\Gamma(k/2+1), while the second has bound [1, Eq. (1.5)]

b2/2uk1eu𝑑u{2(b2/2)k1eb2/2ifb24kΓ(k)otherwise.\int_{b^{-2}/2}^{\infty}u^{k-1}e^{-u}du\leq\begin{cases}2(b^{-2}/2)^{k-1}e^{-b^{-2}/2}&\mbox{if}~{}b^{-2}\geq 4k\\ \Gamma(k)&\mbox{otherwise}.\end{cases}

In the former case—when bb is small enough that b12/kb\leq\frac{1}{2}/\sqrt{k}—we note that

kbk(b2)k1eb2/2=kexp(12b2+(k2)log1b)kexp(2k+k2logklogk)kk/2.kb^{k}(b^{-2})^{k-1}e^{-b^{-2}/2}=k\exp\left(-\frac{1}{2b^{2}}+(k-2)\log\frac{1}{b}\right)\leq k\exp\left(-2k+\frac{k}{2}\log k-\log k\right)\leq k^{k/2}.

Combining the preceding bounds therefore yields

𝔼[|Xi|k]1/kmax{k,bk}.\mathbb{E}[|X_{i}|^{k}]^{1/k}\lesssim\max\left\{\sqrt{k},bk\right\}. (11)

We leverage the moment bound (11) to give the bound on the maxima. Let p1p\geq 1 be arbitrary. Then

𝔼[maxiN|Xi|k]1/k\displaystyle\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]^{1/k} 𝔼[maxiN|Xi|kp]1/kpN1kpmaxiN𝔼[|Xi|kp]1kpN1kpmax{kp,bkp}.\displaystyle\leq\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{kp}\right]^{1/kp}\leq N^{\frac{1}{kp}}\max_{i\leq N}\mathbb{E}[|X_{i}|^{kp}]^{\frac{1}{kp}}\lesssim N^{\frac{1}{kp}}\max\left\{\sqrt{kp},bkp\right\}.

If logNk\log N\geq k, take p=1klogNp=\frac{1}{k}\log N to obtain that 𝔼[maxiN|Xi|k]1/kmax{logN,blogN}\mathbb{E}[\max_{i\leq N}|X_{i}|^{k}]^{1/k}\lesssim\max\{\sqrt{\log N},b\log N\}. Otherwise, take p=1p=1, and note that klogNk\geq\log N. To obtain the result with appropriate scaling, use the mapping bb/τb\mapsto b/\tau to see that if the XiX_{i} are (τ2,b)(\tau^{2},b)-sub-exponential, then

1τ𝔼[maxiN|Xi|k]1/kmax{logN,bτlogN,k,bkτ},\frac{1}{\tau}\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]^{1/k}\lesssim\max\left\{\sqrt{\log N},\frac{b}{\tau}\log N,\sqrt{k},\frac{bk}{\tau}\right\},

and multiply through by τ\tau.

2.5 Proof of Lemma 2.2

Note that ZTAZ=12ZT(A+AT)ZZ^{T}AZ=\frac{1}{2}Z^{T}(A+A^{T})Z; we prove the result leveraging B12(A+AT)B\coloneqq\frac{1}{2}(A+A^{T}). As BB is symmetric, we can write B=UDUTB=UDU^{T} for a diagonal matrix DD and orthogonal UU, and as Z=distUTZZ\stackrel{{\scriptstyle\textup{dist}}}{{=}}U^{T}Z we can further simplify (with no loss of generality) by assuming BB is diagonal with B=diag(b1,,bn)B=\mathop{\rm diag}(b_{1},\ldots,b_{n}). Then ZTBZ=i=1nbiZi2Z^{T}BZ=\sum_{i=1}^{n}b_{i}Z_{i}^{2}. As 𝔼[eλZi2]=1/[12λ]+1/2\mathbb{E}[e^{\lambda Z_{i}^{2}}]=1/\left[{1-2\lambda}\right]_{+}^{1/2}, we have

𝔼[exp(λZTBZ)]=exp(12i=1nlog[12λbi]+).\mathbb{E}[\exp(\lambda Z^{T}BZ)]=\exp\left(-\frac{1}{2}\sum_{i=1}^{n}\log\left[{1-2\lambda b_{i}}\right]_{+}\right).

We use the Taylor approximation that if δ12\delta\leq\frac{1}{2}, then log(1δ)δδ2\log(1-\delta)\geq-\delta-\delta^{2}, so

𝔼[exp(λ(ZTBZtr(B)))]exp(i=1n(λbi+2λ2bi2)λtr(B))=exp(2λ2tr(B2))\mathbb{E}[\exp(\lambda(Z^{T}BZ-\mathop{\rm tr}(B)))]\leq\exp\left(\sum_{i=1}^{n}\left(\lambda b_{i}+2\lambda^{2}b_{i}^{2}\right)-\lambda\mathop{\rm tr}(B)\right)=\exp\left(2\lambda^{2}\mathop{\rm tr}(B^{2})\right)

whenever 0λ14b0\leq\lambda\leq\frac{1}{4\left\|{b}\right\|_{\infty}}. If λ0\lambda\leq 0, an identical calculation holds when λ14b\lambda\geq-\frac{1}{4\left\|{b}\right\|_{\infty}}. This yields the first result of the lemma.

For the second, note that |2B|op=|A+AT|op|A|op+|AT|op=2|A|op\left|\!\left|\!\left|{2B}\right|\!\right|\!\right|_{\rm op}=|\!|\!|{A+A^{T}}|\!|\!|_{\rm op}\leq\left|\!\left|\!\left|{A}\right|\!\right|\!\right|_{\rm op}+|\!|\!|{A^{T}}|\!|\!|_{\rm op}=2\left|\!\left|\!\left|{A}\right|\!\right|\!\right|_{\rm op}, while

tr((A+AT)2)=tr(AA)+tr(AAT)+tr(ATA)+tr(ATAT)4AFr2,\mathop{\rm tr}((A+A^{T})^{2})=\mathop{\rm tr}(AA)+\mathop{\rm tr}(AA^{T})+\mathop{\rm tr}(A^{T}A)+\mathop{\rm tr}(A^{T}A^{T})\leq 4\left\|{A}\right\|_{\rm Fr}^{2},

where we have used that C,D=tr(CTD)\langle C,D\rangle=\mathop{\rm tr}(C^{T}D) is an inner product on matrices and the Cauchy-Schwarz inequality.

References

  • Borwein and Chan [2009] J. M. Borwein and O.-Y. Chan. Uniform bounds for the incomplete complementary Gamma function. Mathematical Inequalities and Applications, 12:115–121, 2009.
  • Buja et al. [1989] A. Buja, T. Hastie, and R. Tibshirani. Linear smoothers and additive models. Annals of Statistics, 17(2):453–555, 1989.
  • Efron [2012] B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Insitute of Mathematical Statistics Monographs. Cambridge University Press, 2012.
  • Tibshirani and Rosset [2019] R. J. Tibshirani and S. Rosset. Excess optimism: How biased is the apparent error of an estimator tuned by SURE? Journal of the American Statistical Association, 114(526):697–712, 2019.
  • van de Geer [2000] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.
  • Wainwright [2019] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.