A comment and erratum on “Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?”

Maxime Cauchois, Alnur Ali, and John Duchi

Abstract

We identify and correct an error in the paper “Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?” This correction allows new guarantees on the excess degrees of freedom—the bias in the error estimate of Stein’s unbiased risk estimate (SURE) for an estimator tuned by directly minimizing the SURE criterion—for arbitrary SURE-tuned linear estimators. Oracle inequalities follow as a consequence of these results for such estimators.

1 Introduction and setting

In Tibshirani and Rosset’s paper [4], they consider the Gaussian sequence model, where

Y=\theta_{0}+Z,~{}~{}~{}Z\sim\mathsf{N}(0,\sigma^{2}I_{n})

(1)

for an unknown vector $\theta_{0}\in\mathbb{R}^{n}$ . Their paper develops theoretical results on the excess optimism, or the amount of downward bias in Stein’s Unbiased Risk Estimate (SURE), when using SURE to select an estimator. In their analysis of subset regression estimators [4, Section 4], there is an error in their calculation of this optimism, which we identify and address in this short note. In addition, we generalize the conclusions of their results, showing how the excess optimism bounds we develop (through their inspiration) extend to essentially arbitrary linear estimators. Their major conclusions and oracle inequalities for SURE-tuned subset regression estimators thus remain true and extend beyond projection estimators.

For any estimator $\widehat{\theta}:\mathbb{R}^{n}\to\mathbb{R}^{n}$ of $\theta_{0}$ , we define the risk

\textup{Risk}\big{(}\widehat{\theta}\,\big{)}\coloneqq\mathbb{E}\left[\|{\widehat{\theta}(Y)-\theta_{0}}\|_{2}^{2}\right].

We study estimation via linear smoothers [2], by which we mean simply that there exists a collection of matrices $\{H_{s}\}_{s\in S}\subset\mathbb{R}^{n\times n}$ , indexed by $s\in S$ , where $S$ is a finite set. Each of these induces an estimator $\widehat{\theta}_{s}(Y)\coloneqq H_{s}Y$ of $\theta_{0}$ , with risk

\displaystyle R(s)\coloneqq\textup{Risk}(\widehat{\theta}_{s})=\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+\sigma^{2}\mathop{\rm tr}(H_{s}^{T}H_{s})=\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+\sigma^{2}\left\|{H_{s}}\right\|_{\rm Fr}^{2}.

We write

p_{s}\coloneqq\mathop{\rm tr}(H_{s})

for the (effective) degrees of freedom [3] of $\widehat{\theta}_{s}(Y)=H_{s}Y$ , i.e., $\textup{df}(\widehat{\theta}_{s})=\frac{1}{\sigma^{2}}\sum_{i=1}^{n}\mathop{\rm Cov}(\widehat{Y}_{i},Y_{i})$ for $\widehat{Y}_{i}=[\widehat{\theta}(Y)]_{i}=[H_{s}Y]_{i}$ . The familiar SURE risk criterion is then

\textup{SURE}(s)\coloneqq\left\|{Y-H_{s}Y}\right\|_{2}^{2}+2\sigma^{2}p_{s},

which satisfies

\mathbb{E}[\textup{SURE}(s)]=\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+\mathbb{E}[\left\|{(I-H_{s})Z}\right\|_{2}^{2}]+2\sigma^{2}p_{s}=R(s)+n\sigma^{2},

so that if $Y^{\prime}$ is an independent draw from the Gaussian sequence model (1), then SURE is an unbiased estimator of the (prediction) error $\mathbb{E}[\|{\widehat{\theta}(Y)-Y^{\prime}}\|_{2}^{2}]$ .

Let $s_{0}\in S$ be the oracle choice of estimator,

\displaystyle s_{0}\coloneqq\mathop{\rm argmin}_{s\in S}\left\{\textup{Risk}(\widehat{\theta}_{s})\right\},

and let $\widehat{s}(y)$ be the estimator minimizing the SURE criterion,

\displaystyle\widehat{s}(Y)\coloneqq\mathop{\rm argmin}_{s\in S}\{\textup{SURE}(s)=\left\|{Y-H_{s}Y}\right\|_{2}^{2}+2\sigma^{2}p_{s}\}.

Tibshirani and Rosset [4] study the excess risk that using $\widehat{s}(Y)$ in place of $s_{0}$ induces in the estimate $\widehat{\theta}_{\widehat{s}}$ of $\theta_{0}$ , defining the excess optimism

\textup{ExOpt}(\widehat{\theta}_{\widehat{s}})\coloneqq\textup{Risk}(\widehat{\theta}_{\widehat{s}})+n\sigma^{2}-\mathbb{E}[\textup{SURE}(\widehat{s}(Y))]=\textup{Risk}(\widehat{\theta}_{\widehat{s}})+n\sigma^{2}-\mathbb{E}\left[\min_{s\in S}\textup{SURE}(s)\right],

which they conjecture is always nonnegative. Rearranging this inequality and using that $\mathbb{E}[\min_{s\in S}\textup{SURE}(s)]\leq\min_{s\in S}\mathbb{E}[\textup{SURE}(s)]$ yields the risk upper bound (cf. [4, Thm. 1]).

\displaystyle\textup{Risk}(\widehat{\theta}_{\widehat{s}})\leq\textup{Risk}(\widehat{\theta}_{s_{0}})+\textup{ExOpt}(\widehat{\theta}_{\widehat{s}}),

(2)

so bounding ExOpt suffices to control $\textup{Risk}(\widehat{\theta}_{\widehat{s}})$ . By a standard calculation [4, Sec. 1],

	$\displaystyle\textup{ExOpt}(\widehat{\theta}_{\widehat{s}})$	$\displaystyle=\mathbb{E}[\\|{\theta_{0}-H_{\widehat{s}(Y)}(Y)}\\|_{2}^{2}]+n\sigma^{2}-\mathbb{E}[\\|{\theta_{0}+Z-H_{\widehat{s}(Y)}(Y)}\\|_{2}^{2}+2\sigma^{2}p_{\widehat{s}(Y)}]$
		$\displaystyle=2\mathbb{E}\left[(H_{\widehat{s}(Y)}(Y)-\theta_{0})^{T}Z]-\sigma^{2}p_{\widehat{s}(Y)}\right].$

Note that $Z=Y-\theta_{0}$ and $\mathbb{E}[Z]=0$ , and define the estimated $Y_{i}$ by $\widehat{Y}_{i}=[H_{\widehat{s}(Y)}(Y)]_{i}$ . Then following [4], we see that if we define the excess degrees of freedom

\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})\coloneqq\mathbb{E}\left[\frac{1}{\sigma^{2}}H_{\widehat{s}(Y)}(Y)^{T}(Y-\theta_{0})-p_{\widehat{s}(Y)}\right]=\bigg{(}\sum_{i=1}^{n}\frac{1}{\sigma^{2}}\mathop{\rm Cov}(\widehat{Y}_{i},Y_{i})\bigg{)}-\mathbb{E}[p_{\widehat{s}(Y)}],

(3)

the gap between the effective degrees of freedom and the parameter “count,” then

\textup{ExOpt}(\widehat{\theta}_{\widehat{s}})=2\sigma^{2}\textup{edf}(\widehat{\theta}_{\widehat{s}}).

Tibshirani and Rosset [4] study bounds on this excess degrees of freedom.

The excess degrees of freedom and an error

By adding and subtracting $H_{\widehat{s}(Y)}(\theta_{0})$ in the definition of the excess degrees of freedom (3), we obtain the equality

\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})=\mathbb{E}\left[\frac{1}{\sigma^{2}}Z^{T}H_{\widehat{s}(Y)}(Z)-p_{\widehat{s}(Y)}\right]+\frac{1}{\sigma^{2}}\mathbb{E}\left[H_{\widehat{s}(Y)}(\theta_{0})^{T}Z\right].

(4)

This a corrected version of [4, Equation (41)], as the referenced equation omits the second term. In this note, we provide bounds on this corrected edf quantity. The rough challenge is that, with a naive analysis, the second term in the expansion (4) may scale as $\sigma\left\|{\theta_{0}}\right\|_{2}\sqrt{\log|S|}$ ; we can prove that it is smaller, and the purpose of Theorem 1 in the next section is to provide a corrected bound on $\textup{edf}(\widehat{\theta}_{\widehat{s}})$ .

Notation

We collect our mostly standard notation here. For functions $f,g:\mathcal{A}\to\mathbb{R}$ , where $\mathcal{A}$ is any abstract index set, we write $f(a)\lesssim g(a)$ to indicate that there is a numerical constant $C<\infty$ , independent of $a\in\mathcal{A}$ , such that $f(a)\leq Cg(a)$ for all $a\in\mathcal{A}$ . We let $\log_{+}t=\max\{0,\log t\}$ for $t>0$ . The operator norm of a matrix $A$ is $\left|\!\left|\!\left|{A}\right|\!\right|\!\right|_{\rm op}=\sup_{\left\|{u}\right\|_{2}=1}\left\|{Au}\right\|_{2}$ .

2 A bound on the excess degrees of freedom

We present a bound on the excess degrees of freedom $\textup{edf}(\widehat{\theta}_{\widehat{s}})$ of the SURE-tuned estimator $\widehat{\theta}_{\widehat{s}(Y)}=H_{\widehat{s}(Y)}(Y)$ , where $\widehat{s}(Y)=\mathop{\rm argmin}_{s\in S}\textup{SURE}(s)$ . In the theorem, we recall the oracle estimator $s_{0}=\mathop{\rm argmin}_{s\in S}\textup{Risk}(\widehat{\theta}_{s})$ .

Theorem 1.

Let $r_{\star}$ satisfy $\frac{1}{\sigma^{2}}R(s_{0})=\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}$ . Assume additionally that $\left|\!\left|\!\left|{H_{s}}\right|\!\right|\!\right|_{\rm op}\leq h_{\textup{op}}$ for all $s\in S$ , where $h_{\textup{op}}\geq 1$ . Then

\textup{edf}(\widehat{\theta}_{\widehat{s}})\lesssim\sqrt{r_{\star}\log|S|}+h_{\textup{op}}\log|S|\cdot\left(1+\log_{+}\frac{h_{\textup{op}}^{2}\log|S|}{r_{\star}}\right).

We prove the theorem in Section 2.3, providing a bit of commentary here by discussing subset regression estimators, as in the paper [4, Sec. 4], then discussing more general smoothers and estimates [2]. As we shall see, the assumption that the matrices $H_{s}$ have bounded operator norm is little loss of generality [2, Sec. 2.5].

2.1 Subset regression and projection estimators

We first revisit the setting that Tibshirani and Rosset [4, Sec. 4] consider, when each matrix $H_{s}$ is an orthogonal projector. As a motiviating example, in linear regression, we may take $H_{s}=X_{s}(X_{s}^{T}X_{s})^{-1}X_{s}^{T}$ , where $X_{s}$ denotes the submatrix of the design $X\in\mathbb{R}^{n\times p}$ whose columns $s$ indexes. In this case, as $H_{s}^{2}=H_{s}$ , we have $\left|\!\left|\!\left|{H_{s}}\right|\!\right|\!\right|_{\rm op}\leq 1$ and $\mathop{\rm tr}(H_{s})=\left\|{H_{s}}\right\|_{\rm Fr}^{2}=p_{s}$ . First consider the case that there is $s^{\star}$ satisfying $H_{s^{\star}}\theta_{0}=\theta_{0}$ , that is, the true model belongs to the set $S$ . Then because $s_{0}$ minimizes $\frac{1}{\sigma^{2}}\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}+p_{s}$ , we have

\textup{edf}(\widehat{\theta}_{\widehat{s}})\lesssim\sqrt{p_{s^{\star}}\log|S|}+\log|S|\cdot\left(1+\log_{+}\frac{\log|S|}{p_{s^{\star}}}\right).

Whenever $p_{s^{\star}}\gtrsim\log|S|$ , we have the simplified bound

\textup{edf}(\widehat{\theta}_{\widehat{s}})\lesssim\sqrt{p_{s^{\star}}\log|S|}.

An alternative way to look at the result is by replacing $r_{\star}$ with $\frac{1}{\sigma^{2}}R(s_{0})=\frac{1}{\sigma^{2}}\min_{s\in S}R(s)$ . In this case, combining [4, Thm. 1] with Theorem 1, we obtain

\displaystyle\textup{Risk}(\widehat{\theta}_{\widehat{s}})

\displaystyle\leq R(s_{0})+C\left(\sqrt{R(s_{0})\cdot\sigma^{2}\log|S|}+\sigma^{2}\log|S|\cdot\left(1+\log_{+}\frac{\sigma^{2}\log|S|}{R(s_{0})}\right)\right)

for a (numerical) constant $C$ . In particular, if $\min_{s\in S}\textup{Risk}(\widehat{\theta}_{s})\gtrsim\sigma^{2}\log|S|$ , then we obtain that there exists a numerical constant $C$ such that for all $\eta>0$ ,

\displaystyle\textup{Risk}(\widehat{\theta}_{\widehat{s}})

\displaystyle\leq(1+C\eta)\min_{s\in S}\textup{Risk}(\widehat{\theta}_{s})+C\frac{\sigma^{2}\log|S|}{\eta}.

(5)

Additional perspective comes by considering some of the bounds Tibshirani and Rosset [4, Sec. 4] develop. While we cannot recover identical results because of the correction term (4) in the excess degrees of freedom, we can recover analogues. As one example, consider their Theorem 2. It assumes that $n\to\infty$ in the sequence model (1), and (implicitly letting $S$ and $s_{0}$ depend on $n$ ) that $\textup{Risk}(\widehat{\theta}_{s_{0}})/\log|S|\to\infty$ . Then inequality (5) immediately gives the oracle inequality $\textup{Risk}(\widehat{\theta}_{\widehat{s}})=(1+o(1))\textup{Risk}(\widehat{\theta}_{s_{0}})$ , and Theorem 1 moreover shows that

\frac{\textup{edf}(\widehat{\theta}_{\widehat{s}})}{\frac{1}{\sigma^{2}}\textup{Risk}(\widehat{\theta}_{s_{0}})}\lesssim\sqrt{\sigma^{2}\frac{\log|S|}{\textup{Risk}(\widehat{\theta}_{s_{0}})}}\to 0.

In brief, Tibshirani and Rosset’s major conclusions on subset regression estimators—and projection estimators more generally—hold, but with a few tweaks.

2.2 Examples of more general smoothers

In more generality, the matrices $H_{s}\in\mathbb{R}^{n\times n}$ defining the estimators may be arbitrary. In common cases, however, they are reasonably well-behaved. Take as motivation nonparametric regression, where $Y_{i}=f(X_{i})+\varepsilon_{i}$ , and we let $\theta_{i}=f(X_{i})$ for $i=1,\ldots,n$ and $X_{i}\in\mathcal{X}$ , so that $\textup{Risk}(\widehat{\theta})$ measures the in-sample risk of an estimator $\widehat{\theta}$ of $[f(X_{i})]_{i=1}^{n}$ . Here, standard estimators include kernel regression and local averaging [2, 6], both of which we touch on.

First consider kernel ridge regression (KRR). For a reproducing kernel $K:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ , the (PSD) Gram matrix $G$ has entries $G_{ij}=K(X_{i},X_{j})$ . Then for $\lambda\in\mathbb{R}_{+}$ , we define $H_{\lambda}=(G+\lambda I)^{-1}G$ , which is symmetric and positive semidefinite, and satisfies $\left|\!\left|\!\left|{H_{\lambda}}\right|\!\right|\!\right|_{\rm op}\leq 1$ . Assume the collection $\Lambda\subset\mathbb{R}_{+}$ is finite and define the effective dimension $p_{\lambda}=\mathop{\rm tr}(G(G+\lambda I)^{-1})$ , yielding that $\textup{SURE}(\lambda)=\|{Y-H_{\lambda}Y}\|_{2}^{2}+2\sigma^{2}p_{\lambda}$ . Then SURE-tuned KRR, with regularization $\widehat{\lambda}=\mathop{\rm argmin}_{\lambda\in\Lambda}\textup{SURE}(\lambda)$ and $\widehat{\theta}_{\widehat{\lambda}}=H_{\widehat{\lambda}}Y$ , satisfies

\textup{edf}(\widehat{\theta}_{\widehat{\lambda}})\lesssim\sqrt{r_{\star}\log|\Lambda|}+\log|\Lambda|\cdot\left(1+\log_{+}\frac{\log|\Lambda|}{r_{\star}}\right)

where as usual, $r_{\star}=\min_{\lambda\in\Lambda}\{\frac{1}{\sigma^{2}}\left\|{(I-H_{\lambda})\theta_{0}}\right\|_{2}^{2}+\mathop{\rm tr}(G(G+\lambda I)^{-1})\}$ .

A second example arises from $k$ -nearest-neighbor ( $k$ -nn) estimators. We take $S=\{1,\ldots,n\}$ to indicate the number of nearest neighbors to average, and for $k\in S$ let $\mathcal{N}_{k}(i)$ denote the indices of the $k$ nearest neighbors $X_{j}$ to $X_{i}$ in $\{X_{1},\ldots,X_{n}\}$ , so $\mathcal{N}_{k}^{-1}(i)=\{j\mid i\in\mathcal{N}_{k}(j)\}$ are the indices $j$ for which $X_{i}$ is a neighbor of $X_{j}$ . Then the matrix $H_{k}\in\mathbb{R}^{n\times n}_{+}$ satisfies $[H_{k}]_{ij}=\frac{1}{k}1\!\left\{j\in\mathcal{N}_{k}(i)\right\}$ , and we claim that $\left|\!\left|\!\left|{H_{k}}\right|\!\right|\!\right|_{\rm op}\leq\frac{1}{k}\max_{i}|\mathcal{N}_{k}^{-1}(i)|$ . Indeed,

[H_{k}^{T}H_{k}]_{ij}=\frac{1}{k^{2}}\sum_{l=1}^{n}1\!\left\{i\in\mathcal{N}_{k}(l),j\in\mathcal{N}_{k}(l)\right\},

and as $H_{k}^{T}H_{k}$ is elementwise nonnegative, the Gershgorin circle theorem guarantees

\displaystyle\left|\!\left|\!\left|{H_{k}}\right|\!\right|\!\right|_{\rm op}^{2}\leq\max_{i\leq n}\sum_{j=1}^{n}[H_{k}^{T}H_{k}]_{ij}

\displaystyle\leq\frac{1}{k^{2}}\max_{i\leq n}\sum_{l=1}^{n}\sum_{j=1}^{n}1\!\left\{i\in\mathcal{N}_{k}(l),j\in\mathcal{N}_{k}(l)\right\}\leq\frac{1}{k}\max_{i\leq n}\sum_{l=1}^{n}1\!\left\{i\in\mathcal{N}_{k}(l)\right\}

as $|\mathcal{N}_{k}(l)|\leq k$ for each $l$ . Additionally, we have $\left\|{H_{k}}\right\|_{\rm Fr}^{2}=\frac{nk}{k^{2}}=\frac{n}{k}$ . The normalized risk of the $k$ -nn estimator is then $r_{k}=\frac{1}{\sigma^{2}}\left\|{(I-H_{k})\theta_{0}}\right\|_{2}^{2}+\frac{n}{k}$ , and certainly $r_{k}\geq\log n$ whenever $k\leq\frac{n}{\log n}$ . Under a few restrictions, we can therefore obtain an oracle-type inequality: assume the points $\{X_{1},\ldots,X_{n}\}$ are regular enough that $\max_{i}|\mathcal{N}_{k}^{-1}(i)|\lesssim k$ for $k\leq\frac{n}{\log n}$ . Then the SURE-tuned $k$ -nearest-neighbor estimator $\widehat{\theta}_{k}$ satisfies the bound

\textup{edf}(\widehat{\theta}_{\widehat{k}})\lesssim\sqrt{\frac{1}{\sigma^{2}}\min_{k\leq n/\log n}\textup{Risk}(\widehat{\theta}_{k})\cdot\log n}

on its excess degrees of freedom. Via inequality (2), this implies the oracle inequality

\textup{Risk}(\widehat{\theta}_{\widehat{k}})\leq\min_{k\leq n/\log n}\left(1+C\eta\right)\textup{Risk}(\widehat{\theta}_{k})+\frac{C\sigma^{2}\log n}{\eta}~{}~{}~{}\mbox{for~{}all~{}}\eta>0.

2.3 Proof of Theorem 1

Our proof strategy is familiar from high-dimensional statistics [cf. 6, Chs. 7 & 9]: we develop a basic inequality relating the risk of $\widehat{s}$ to that of $s_{0}$ , then apply a peeling argument [5] to bound the probability that relative error bounds deviate far from their expectations. Throughout, we let $C$ denote a universal (numerical) constant whose value may change from line to line.

To prove the theorem, we first recall a definition and state two auxiliary lemmas.

Definition 2.1.

A mean-zero random variable $X$ is $(\tau^{2},b)$ -sub-exponential if $\mathbb{E}[e^{\lambda X}]\leq\exp(\frac{\lambda^{2}\tau^{2}}{2})$ for $|\lambda|\leq\frac{1}{b}$ . If $b=0$ , then $X$ is $\tau^{2}$ -sub-Gaussian.

Lemma 2.1.

Let $X_{i}$ , $i=1,\ldots,N$ , be $\tau^{2}$ -sub-Gaussian and $k\geq 1$ . Then

\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]\leq 2\cdot\tau^{k}\max\left\{(2\log N)^{k/2},k^{k/2}\right\}.

Let $X_{i}$ , $i=1,\ldots,N$ , be $(\tau^{2},b)$ -sub-exponential and $k\geq 1$ . Then

\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]^{1/k}\lesssim\max\left\{\sqrt{\tau^{2}\log N},b\log N,\sqrt{\tau^{2}k},bk\right\}.

The second statement of the lemma generalizes the first without specifying constants. We also use that quadratic forms of Gaussian random vectors are sub-exponential.

Lemma 2.2.

Let $Z\sim\mathsf{N}(0,I_{n\times n})$ and $A\in\mathbb{R}^{n\times n}$ . Then $Z^{T}AZ$ is $(\mathop{\rm tr}((A+A^{T})^{2}),2|\!|\!|{A+A^{T}}|\!|\!|_{\rm op})$ -sub-exponential. Additionally, $Z^{T}AZ$ is $(\left\|{A}\right\|_{\rm Fr}^{2},4|\!|\!|{A}|\!|\!|_{\rm op})$ -sub-exponential.

We defer the proofs of Lemmas 2.1 and 2.2 to Sections 2.4 and 2.5, respectively.

Recall the notation $R(s)=\left\|{(H_{s}-I)\theta_{0}}\right\|_{2}^{2}+\sigma^{2}p_{s}$ , and for $s\in S$ define the centered variables

\begin{split}W_{s}&\coloneqq\frac{1}{\sigma^{2}}Z^{T}(2H_{s}-H_{s}^{T}H_{s})Z+\mathop{\rm tr}(H_{s}^{T}H_{s})-2p_{s},\\ Z_{s}&\coloneqq\frac{1}{\sigma^{2}}\theta_{0}^{T}(H_{s}-I)^{T}(I-H_{s})Z.\end{split}

(6)

A quick calculation shows that for every $s\in S$ ,

\displaystyle\frac{1}{\sigma^{2}}\textup{SURE}(s)

\displaystyle=\frac{1}{\sigma^{2}}R(s)+\frac{1}{\sigma^{2}}\left\|{Z}\right\|_{2}^{2}-W_{s}-2Z_{s}.

As $\widehat{s}(Y)$ minimizes the SURE criterion, we therefore have the basic inequality

\frac{1}{\sigma^{2}}\left(R(\widehat{s}(Y))-R(s_{0})\right)\leq W_{\widehat{s}(Y)}-W_{s_{0}}+2(Z_{\widehat{s}(Y)}-Z_{s_{0}}).

(7)

We now provide a peeling argument using inequality (7). For each $l\in\mathbb{N}$ define the shell

\displaystyle\mathcal{S}_{l}\coloneqq\{s\in S\mid(2^{l}-1)\sigma^{2}r_{\star}\leq R(s)-R(s_{0})\leq(2^{l+1}-1)\sigma^{2}r_{\star}\}.

The key result is the following lemma.

Lemma 2.3.

There exists a numerical constant $c>0$ such that

\mathbb{P}(\widehat{s}(Y)\in\mathcal{S}_{l})\leq 2|\mathcal{S}_{l}|\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right).

Proof Note that if $\widehat{s}(Y)\in\mathcal{S}_{l}$ , we have

\displaystyle\max_{s\in\mathcal{S}_{l}}\left(W_{s}-W_{s_{0}}\right)+2\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\max_{s\in\mathcal{S}_{l}}\left(W_{s}-W_{s_{0}}+2(Z_{s}-Z_{s_{0}})\right)\geq r_{\star}(2^{l}-1),

and so it must be the case that at least one of

\max_{s\in\mathcal{S}_{l}}W_{s}-W_{s_{0}}\geq\frac{1}{2}r_{\star}(2^{l}-1)~{}~{}~{}\mbox{or}~{}~{}~{}\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\frac{1}{4}r_{\star}(2^{l}-1)

(8)

occurs. We can thus bound the probability that $\widehat{s}(Y)\in\mathcal{S}_{l}$ by bounding the probabilities of each of the events (8); to do this, we that $W_{s}-W_{s_{0}}$ and $Z_{s}-Z_{s_{0}}$ concentrate for $s\in\mathcal{S}_{l}$ .

As promised, we now show that $W_{s}-W_{s_{0}}$ and $Z_{s}-Z_{s_{0}}$ are sub-exponential and sub-Gaussian, respectively (recall Definition 2.1). Observe that

	$\displaystyle\frac{1}{\sigma^{2}}(R(s)-R(s_{0}))$	$\displaystyle=\frac{1}{\sigma^{2}}\left\\|{(I-H_{s})\theta_{0}}\right\\|_{2}^{2}-\frac{1}{\sigma^{2}}\left\\|{(I-H_{s_{0}})\theta_{0}}\right\\|_{2}^{2}+\left\\|{H_{s}}\right\\|_{\rm Fr}^{2}-\left\\|{H_{s_{0}}}\right\\|_{\rm Fr}^{2}$
		$\displaystyle\geq\left\\|{H_{s}}\right\\|_{\rm Fr}^{2}-r_{\star}$		(9)

by assumption that $r_{\star}\geq\frac{1}{\sigma^{2}}R(s_{0})=\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}$ and that $\left\|{(I-H_{s})\theta_{0}}\right\|_{2}^{2}\geq 0$ . In particular, inequality (9) shows that for each $s\in\mathcal{S}_{l}$ we have $\left\|{H_{s}}\right\|_{\rm Fr}^{2}-r_{\star}\leq(2^{l+1}-1)r_{\star}$ (and we always have $\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}$ ), so that

\left\|{H_{s}}\right\|_{\rm Fr}^{2}\leq 2^{l+1}r_{\star}~{}~{}~{}\mbox{and}~{}~{}~{}\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}~{}~{}~{}\mbox{for~{}}s\in\mathcal{S}_{l}.

(10)

For each $s$ we have

W_{s}-W_{s_{0}}=\frac{1}{\sigma^{2}}Z^{T}(2H_{s}-2H_{s_{0}}-H_{s}^{T}H_{s}+H_{s_{0}}^{T}H_{s_{0}})Z+\left\|{H_{s}}\right\|_{\rm Fr}^{2}-\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}-2(p_{s}-p_{s_{0}}),

while

\left\|{2H_{s}-2H_{s_{0}}-H_{s}^{T}H_{s}+H_{s_{0}}^{T}H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq 4\cdot\left(4\left\|{H_{s}}\right\|_{\rm Fr}^{2}+4\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}+h_{\textup{op}}^{2}\left\|{H_{s}}\right\|_{\rm Fr}^{2}+h_{\textup{op}}^{2}\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\right)

by assumption that $\left|\!\left|\!\left|{H_{s}}\right|\!\right|\!\right|_{\rm op}\leq h_{\textup{op}}$ for all $s\in S$ . Lemma 2.2 and the bounds (10) on $\left\|{H_{s}}\right\|_{\rm Fr}$ thus give that $W_{s}-W_{s_{0}}$ is $(C\cdot 2^{l}h_{\textup{op}}^{2}r_{\star},C\cdot h_{\textup{op}}^{2})$ -sub-exponential, so that for $t\geq 0$ , a Chernoff bound implies

\displaystyle\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(W_{s}-W_{s_{0}})\geq t\right)

\displaystyle\leq|\mathcal{S}_{l}|\exp\left(C\cdot 2^{l}\lambda^{2}h_{\textup{op}}^{2}r_{\star}-\lambda t\right),

valid for $0\leq\lambda\leq\frac{1}{Ch_{\textup{op}}}$ . Taking $t=\frac{1}{2}(2^{l}-1)r_{\star}$ and $\lambda=\frac{1}{C^{\prime}h_{\textup{op}}^{2}}$ yields that for a numerical constant $c>0$ ,

\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(W_{s}-W_{s_{0}})\geq\frac{1}{2}(2^{l}-1)r_{\star}\right)\leq|\mathcal{S}_{l}|\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right).

We can provide a similar bound on $Z_{s}-Z_{s_{0}}$ for $s\in\mathcal{S}_{l}$ . It is immediate that $Z_{s}\sim\mathsf{N}(0,\frac{1}{\sigma^{2}}\left\|{(H_{s}-I)^{T}(I-H_{s})\theta_{0}}\right\|_{2}^{2})$ . Using inequality (9) and that $\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}+\left\|{H_{s_{0}}}\right\|_{\rm Fr}^{2}\leq r_{\star}$ , for each $s\in\mathcal{S}_{l}$ we have

	$\displaystyle\frac{1}{\sigma^{2}}\left\\|{(I-H_{s})\theta_{0}}\right\\|_{2}^{2}$	$\displaystyle=\frac{1}{\sigma^{2}}(R(s)-R(s_{0}))+\frac{1}{\sigma^{2}}\left\\|{(I-H_{s_{0}})\theta_{0}}\right\\|_{2}^{2}+\left\\|{H_{s_{0}}}\right\\|_{\rm Fr}^{2}-\left\\|{H_{s}}\right\\|_{\rm Fr}^{2}$
		$\displaystyle\leq r_{\star}(2^{l+1}-1)+r_{\star}-p_{s}\leq 2^{l+1}r_{\star}.$

Similarly, $Z_{s_{0}}\sim\mathsf{N}(0,\frac{1}{\sigma^{2}}\left\|{(H_{s_{0}}-I)^{T}(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2})$ and $\frac{1}{\sigma^{2}}\left\|{(I-H_{s_{0}})\theta_{0}}\right\|_{2}^{2}\leq r_{\star}$ . Using that $\left|\!\left|\!\left|{I-H_{s}}\right|\!\right|\!\right|_{\rm op}\leq(1+h_{\textup{op}})$ , for each $s\in\mathcal{S}_{l}$ we have $Z_{s}-Z_{s_{0}}\sim\mathsf{N}(0,\tau^{2}(s))$ for some $\tau^{2}(s)\leq C\cdot h_{\textup{op}}^{2}2^{l}r_{\star}$ . This yields the bound

\displaystyle\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\frac{1}{4}r_{\star}(2^{l}-1)\right)

\displaystyle\leq|\mathcal{S}_{l}|\exp\left(-c\frac{r_{\star}(2^{l}-1)^{2}}{2^{l}h_{\textup{op}}^{2}}\right)\leq|\mathcal{S}_{l}|\exp\left(-c^{\prime}\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right),

where $c,c^{\prime}>0$ are numerical constants. Returning to the events (8), we have shown

	$\displaystyle\mathbb{P}(\widehat{s}(Y)\in\mathcal{S}_{l})$	$\displaystyle\leq\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(W_{s}-W_{s_{0}})\geq\frac{1}{2}(2^{l}-1)r_{\star}\right)+\mathbb{P}\left(\max_{s\in\mathcal{S}_{l}}(Z_{s}-Z_{s_{0}})\geq\frac{1}{4}r_{\star}(2^{l}-1)\right)$
		$\displaystyle\leq 2\|\mathcal{S}_{l}\|\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right)$

as desired. ∎

We leverage the probability bound in Lemma 2.3 to give our final guarantees. We expand Define the (centered) linear and quadratic terms

Q_{s}\coloneqq\frac{1}{\sigma^{2}}Z^{T}H_{s}Z-p_{s}~{}~{}~{}\mbox{and}~{}~{}~{}L_{s}\coloneqq\frac{1}{\sigma^{2}}Z^{T}H_{s}\theta_{0},

so that $\textup{edf}(\widehat{\theta}_{\widehat{s}})=\mathbb{E}[Q_{\widehat{s}(Y)}]+\mathbb{E}[L_{\widehat{s}(Y)}]$ . Expanding this equality, we have

\textup{edf}(\widehat{\theta}_{\widehat{s}})=\sum_{l=0}^{\infty}\mathbb{E}[Q_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}]+\mathbb{E}[L_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}].

As in the proof of Lemma 2.3, the bounds (10) that $\left\|{H_{s}}\right\|_{\rm Fr}^{2}\leq 2^{l+1}r_{\star}$ and Lemma 2.2 guarantee that $Q_{s}$ is $(2^{l+3}r_{\star},4h_{\textup{op}})$ -sub-exponential. Thus we have

	$\displaystyle\mathbb{E}\left[Q_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}\right]$	$\displaystyle\leq\mathbb{E}\left[\max_{s\in\mathcal{S}_{l}}Q_{s}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}\right]\stackrel{{\scriptstyle(i)}}{{\leq}}\mathbb{E}\left[\max_{s\in\mathcal{S}_{l}}Q_{s}^{2}\right]^{1/2}\mathbb{P}(\widehat{s}(Y)\in\mathcal{S}_{l})^{1/2}$
		$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\lesssim}}\max\left\{\sqrt{2^{l}r_{\star}\log\|\mathcal{S}_{l}\|},h_{\textup{op}}\log\|\mathcal{S}_{l}\|\right\}\min\left\{1,\sqrt{\|\mathcal{S}_{l}\|}\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right)\right\},$

where inequality $(i)$ is Cauchy-Schwarz and inequality $(ii)$ follows by combining Lemma 2.1 (take $k=2$ ) and Lemma 2.3. We similarly have that $L_{s}$ is $C\cdot 2^{l}r_{\star}$ -sub-Gaussian, yielding

\mathbb{E}[L_{\widehat{s}(Y)}1\!\left\{\widehat{s}(Y)\in\mathcal{S}_{l}\right\}]\lesssim\sqrt{2^{l}r_{\star}\log|\mathcal{S}_{l}|}\min\left\{1,\sqrt{|\mathcal{S}_{l}|}\exp\left(-c\frac{2^{l}r_{\star}}{h_{\textup{op}}^{2}}\right)\right\}.

Temporarily introduce the shorthand $r=\frac{r_{\star}}{h_{\textup{op}}^{2}}$ . Substituting these bounds into the $\textup{edf}(\widehat{\theta}_{\widehat{s}})$ expansion above and naively bounding $|\mathcal{S}_{l}|\leq|S|$ yields

	$\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})$
		$\displaystyle\lesssim\sqrt{r_{\star}\log\|S\|}\int_{0}^{\infty}\sqrt{2^{t}\min\{1,\|S\|\exp(-c2^{t}r)\}}dt+h_{\textup{op}}\log\|S\|\int_{0}^{\infty}\sqrt{\min\{1,\|S\|\exp(-c2^{t}r)\}}dt$
		$\displaystyle\qquad\qquad~{}+\sqrt{r_{\star}\log\|S\|}+h_{\textup{op}}\log\|S\|$
		$\displaystyle\lesssim\sqrt{\log\|S\|}\int_{cr/2}^{\infty}u^{-\frac{1}{2}}\min\{1,\sqrt{\|S\|}e^{-u}\}du+h_{\textup{op}}\log\|S\|\int_{cr/2}^{\infty}\frac{1}{u}\min\{1,\sqrt{\|S\|}e^{-u}\}du$
		$\displaystyle\qquad\qquad~{}+\sqrt{r_{\star}\log\|S\|}+h_{\textup{op}}\log\|S\|,$

where we made the substitution $u=c2^{t-1}r$ . We break each of the integrals into the regions $\frac{1}{2}cr\leq u\leq\frac{1}{2}\max\{cr,\log|S|\}$ and $u\geq\frac{1}{2}\max\{cr,\log|S|\}$ . Thus

\displaystyle\sqrt{\log|S|}\int_{cr/2}^{\infty}u^{-\frac{1}{2}}\min\{1,\sqrt{|S|}e^{-u}\}du

\displaystyle\leq\begin{cases}\sqrt{2}\log|S|+\sqrt{2}&\mbox{if}~{}\log|S|\geq cr\\ \sqrt{2}&\mbox{if~{}}cr\geq\log|S|.\end{cases}

where we have used $\int_{b}^{\infty}u^{-1/2}e^{-u}du\leq b^{-1/2}e^{-b}$ for $b>0$ and that $\frac{\sqrt{|S|\log|S|}}{\sqrt{cr/2}}\exp(-\frac{1}{2}cr)\leq\sqrt{2}$ whenever $cr\geq\log|S|$ . For the second integral, we have

\displaystyle\log|S|\int_{cr/2}^{\infty}\frac{1}{u}\min\{1,\sqrt{|S|}e^{-u}\}du

\displaystyle\leq\begin{cases}\log|S|\cdot\log\frac{\log|S|}{cr}+2&\mbox{if~{}}\log|S|\geq cr\\ 2&\mbox{if}~{}cr\geq\log|S|,\end{cases}

where we have used that $\int_{b}^{\infty}\frac{1}{u}e^{-u}du\leq e^{-b}/b$ for any $b\geq 0$ . Replacing $r=r_{\star}/h_{\textup{op}}^{2}$ , Theorem 1 follows.

2.4 Proof of Lemma 2.1

For the first statement, without loss of generality by scaling, we may assume $\sigma^{2}=1$ . Then for any $t_{0}\geq 0$ , we may write

	$\displaystyle\mathbb{E}[\max_{i\leq N}\|X_{i}\|^{k}]$	$\displaystyle=\int_{0}^{\infty}\mathbb{P}\left(\max_{i\leq N}\|X_{i}\|\geq t^{1/k}\right)dt$
		$\displaystyle\leq t_{0}+2N\int_{t_{0}}^{\infty}\exp\left(-\frac{t^{2/k}}{2}\right)dt$
		$\displaystyle=t_{0}+2^{k/2}Nk\int_{t_{0}^{2/k}/2}^{\infty}u^{k/2-1}e^{-u}du,$

where we have made the substitution $u=t^{2/k}/2$ , or $2^{k/2}u^{k/2}=t$ . Using [1, Eq. (1.5)], which states that $\int_{x}^{\infty}x^{a-1}e^{-x}dx\leq 2x^{a-1}e^{-x}$ for $x>a-1$ , we obtain that

\displaystyle\mathbb{E}[\max_{i\leq N}|X_{i}|^{k}]

\displaystyle\leq t_{0}+2^{k/2+1}Nk\left(\frac{1}{2}t_{0}^{2/k}\right)^{k/2-1}\exp\left(-\frac{t_{0}^{2/k}}{2}\right)

whenever $t_{0}\geq k^{k/2}$ . Take $t_{0}=\max\{(2\log N)^{k/2},k^{k/2}\}$ to achieve the result.

The second bound is more subtle. First, we note that $(X_{i}/\tau)$ is $(1,b/\tau)$ -sub-exponential. We therefore prove the bound for $(1,b)$ -sub-exponential random variables, rescaling at the end. Following a similar argument to that above, note that $\mathbb{P}(|X_{i}|\geq t)\leq 2\exp(-\min\{\frac{t}{2b},\frac{t^{2}}{2}\})$ , and so

	$\displaystyle\mathbb{E}[\|X_{i}\|^{k}]$	$\displaystyle\leq\int_{0}^{\infty}\mathbb{P}(\|X_{i}\|\geq t^{1/k})dt\leq 2\int_{0}^{\infty}\exp\left(-\min\left\{\frac{t^{1/k}}{2b},\frac{t^{2/k}}{2}\right\}\right)dt$
		$\displaystyle=2\int_{0}^{b^{-k}}\exp\left(-\frac{t^{2/k}}{2}\right)dt+2\int_{b^{-k}}^{\infty}\exp\left(-\frac{t^{1/k}}{2b}\right)dt.$

Making the substitution $u=t^{2/k}/2$ , or $ku^{k/2-1}du=dt$ in the first integral, and $u=t^{1/k}/(2b)$ , or $(2b)^{k}ku^{k-1}du=dt$ in the second, we obtain the bounds

\displaystyle\mathbb{E}[|X_{i}|^{k}]

\displaystyle\leq 2k\int_{0}^{b^{-2}/2}u^{k/2-1}e^{-u}du+2^{k+1}b^{k}k\int_{b^{-2}/2}^{\infty}u^{k-1}e^{-u}du.

The first integral term always has upper bound $2k\Gamma(k/2)=4\Gamma(k/2+1)$ , while the second has bound [1, Eq. (1.5)]

\int_{b^{-2}/2}^{\infty}u^{k-1}e^{-u}du\leq\begin{cases}2(b^{-2}/2)^{k-1}e^{-b^{-2}/2}&\mbox{if}~{}b^{-2}\geq 4k\\ \Gamma(k)&\mbox{otherwise}.\end{cases}

In the former case—when $b$ is small enough that $b\leq\frac{1}{2}/\sqrt{k}$ —we note that

kb^{k}(b^{-2})^{k-1}e^{-b^{-2}/2}=k\exp\left(-\frac{1}{2b^{2}}+(k-2)\log\frac{1}{b}\right)\leq k\exp\left(-2k+\frac{k}{2}\log k-\log k\right)\leq k^{k/2}.

Combining the preceding bounds therefore yields

\mathbb{E}[|X_{i}|^{k}]^{1/k}\lesssim\max\left\{\sqrt{k},bk\right\}.

(11)

We leverage the moment bound (11) to give the bound on the maxima. Let $p\geq 1$ be arbitrary. Then

\displaystyle\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]^{1/k}

\displaystyle\leq\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{kp}\right]^{1/kp}\leq N^{\frac{1}{kp}}\max_{i\leq N}\mathbb{E}[|X_{i}|^{kp}]^{\frac{1}{kp}}\lesssim N^{\frac{1}{kp}}\max\left\{\sqrt{kp},bkp\right\}.

If $\log N\geq k$ , take $p=\frac{1}{k}\log N$ to obtain that $\mathbb{E}[\max_{i\leq N}|X_{i}|^{k}]^{1/k}\lesssim\max\{\sqrt{\log N},b\log N\}$ . Otherwise, take $p=1$ , and note that $k\geq\log N$ . To obtain the result with appropriate scaling, use the mapping $b\mapsto b/\tau$ to see that if the $X_{i}$ are $(\tau^{2},b)$ -sub-exponential, then

\frac{1}{\tau}\mathbb{E}\left[\max_{i\leq N}|X_{i}|^{k}\right]^{1/k}\lesssim\max\left\{\sqrt{\log N},\frac{b}{\tau}\log N,\sqrt{k},\frac{bk}{\tau}\right\},

and multiply through by $\tau$ .

2.5 Proof of Lemma 2.2

Note that $Z^{T}AZ=\frac{1}{2}Z^{T}(A+A^{T})Z$ ; we prove the result leveraging $B\coloneqq\frac{1}{2}(A+A^{T})$ . As $B$ is symmetric, we can write $B=UDU^{T}$ for a diagonal matrix $D$ and orthogonal $U$ , and as $Z\stackrel{{\scriptstyle\textup{dist}}}{{=}}U^{T}Z$ we can further simplify (with no loss of generality) by assuming $B$ is diagonal with $B=\mathop{\rm diag}(b_{1},\ldots,b_{n})$ . Then $Z^{T}BZ=\sum_{i=1}^{n}b_{i}Z_{i}^{2}$ . As $\mathbb{E}[e^{\lambda Z_{i}^{2}}]=1/\left[{1-2\lambda}\right]_{+}^{1/2}$ , we have

\mathbb{E}[\exp(\lambda Z^{T}BZ)]=\exp\left(-\frac{1}{2}\sum_{i=1}^{n}\log\left[{1-2\lambda b_{i}}\right]_{+}\right).

We use the Taylor approximation that if $\delta\leq\frac{1}{2}$ , then $\log(1-\delta)\geq-\delta-\delta^{2}$ , so

\mathbb{E}[\exp(\lambda(Z^{T}BZ-\mathop{\rm tr}(B)))]\leq\exp\left(\sum_{i=1}^{n}\left(\lambda b_{i}+2\lambda^{2}b_{i}^{2}\right)-\lambda\mathop{\rm tr}(B)\right)=\exp\left(2\lambda^{2}\mathop{\rm tr}(B^{2})\right)

whenever $0\leq\lambda\leq\frac{1}{4\left\|{b}\right\|_{\infty}}$ . If $\lambda\leq 0$ , an identical calculation holds when $\lambda\geq-\frac{1}{4\left\|{b}\right\|_{\infty}}$ . This yields the first result of the lemma.

For the second, note that $\left|\!\left|\!\left|{2B}\right|\!\right|\!\right|_{\rm op}=|\!|\!|{A+A^{T}}|\!|\!|_{\rm op}\leq\left|\!\left|\!\left|{A}\right|\!\right|\!\right|_{\rm op}+|\!|\!|{A^{T}}|\!|\!|_{\rm op}=2\left|\!\left|\!\left|{A}\right|\!\right|\!\right|_{\rm op}$ , while

\mathop{\rm tr}((A+A^{T})^{2})=\mathop{\rm tr}(AA)+\mathop{\rm tr}(AA^{T})+\mathop{\rm tr}(A^{T}A)+\mathop{\rm tr}(A^{T}A^{T})\leq 4\left\|{A}\right\|_{\rm Fr}^{2},

where we have used that $\langle C,D\rangle=\mathop{\rm tr}(C^{T}D)$ is an inner product on matrices and the Cauchy-Schwarz inequality.

References

Borwein and Chan [2009] J. M. Borwein and O.-Y. Chan. Uniform bounds for the incomplete complementary Gamma function. Mathematical Inequalities and Applications, 12:115–121, 2009.
Buja et al. [1989] A. Buja, T. Hastie, and R. Tibshirani. Linear smoothers and additive models. Annals of Statistics, 17(2):453–555, 1989.
Efron [2012] B. Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction. Insitute of Mathematical Statistics Monographs. Cambridge University Press, 2012.
Tibshirani and Rosset [2019] R. J. Tibshirani and S. Rosset. Excess optimism: How biased is the apparent error of an estimator tuned by SURE? Journal of the American Statistical Association, 114(526):697–712, 2019.
van de Geer [2000] S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.
Wainwright [2019] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press, 2019.

	$\displaystyle\textup{edf}(\widehat{\theta}_{\widehat{s}})$
		$\displaystyle\lesssim\sqrt{r_{\star}\log\|S\|}\int_{0}^{\infty}\sqrt{2^{t}\min\{1,\|S\|\exp(-c2^{t}r)\}}dt+h_{\textup{op}}\log\|S\|\int_{0}^{\infty}\sqrt{\min\{1,\|S\|\exp(-c2^{t}r)\}}dt$
		$\displaystyle\qquad\qquad~{}+\sqrt{r_{\star}\log\|S\|}+h_{\textup{op}}\log\|S\|$
		$\displaystyle\lesssim\sqrt{\log\|S\|}\int_{cr/2}^{\infty}u^{-\frac{1}{2}}\min\{1,\sqrt{\|S\|}e^{-u}\}du+h_{\textup{op}}\log\|S\|\int_{cr/2}^{\infty}\frac{1}{u}\min\{1,\sqrt{\|S\|}e^{-u}\}du$
		$\displaystyle\qquad\qquad~{}+\sqrt{r_{\star}\log\|S\|}+h_{\textup{op}}\log\|S\|,$