This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimax Signal Detection in Sparse Additive Models

Subhodh Kotekal and Chao Gao
University of Chicago
Abstract

Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing kernel Hilbert space. Unlike the estimation theory, the minimax separation rate reveals a nontrivial interaction between sparsity and the choice of function space. We also investigate adaptation to sparsity and establish an adaptive testing rate for a generic function space; adaptation is possible in some spaces while others impose an unavoidable cost. Finally, adaptation to both sparsity and smoothness is studied in the setting of Sobolev space, and we correct some existing claims in the literature.

1 Introduction

In the interest of interpretability, computation, and circumventing the statistical curse of dimensionality plaguing high dimensional regression, structure is often assumed on the true regression function. Indeed, it might plausibly be argued that sparse linear regression is the distinguishing export of modern statistics. Despite its popularity, circumstances may call for more flexibility to capture nonlinear effects of the covariates. Striking a balance between flexibility and structure, Hastie and Tibshirani [20] proposed generalized additive models (GAMs) as a natural extension to the vaunted linear model. In a GAM, the regression function admits an additive decomposition of univariate (nonlinear) component functions. However, as in the linear model, the sample size must outpace the dimension for consistent estimation. Following modern statistical instinct, a sparse additive model is compelling [33, 38, 29, 50, 35, 39]. The regression function admits an additive decomposition of univariate functions for which only a small subset are nonzero; it is the combination of a GAM and sparsity.

To fix notation, consider the pp-dimensional Gaussian white noise model

dYx=f(x)dx+1ndBxdY_{x}=f(x)\,dx+\frac{1}{\sqrt{n}}dB_{x} (1)

for x[0,1]px\in[0,1]^{p}. Though it may be more faithful to practical data analysis to consider the nonparametric regression model Yi=f(Xi)+ϵiY_{i}=f(X_{i})+\epsilon_{i} (e.g. as in [38]), the Gaussian white noise model is convenient as it avoids distracting technicalities while maintaining focus on the statistical essence. Indeed, the nonparametric statistics literature has a long history of studying the white noise model to understand theoretical limits, relying on well-known asymptotic equivalences [5, 40] which imply, under various conditions, that mathematical results obtained in one model can be ported over to the other model. As our focus is theoretical rather than practical, we follow in this tradition. The generalized additive model asserts the regression function is of the form f(x)=j=1pfj(xj)f(x)=\sum_{j=1}^{p}f_{j}(x_{j}) with f1,,fpf_{1},...,f_{p} being univariate functions belonging to some function space \mathcal{H}. Likewise, the sparse additive model asserts

f(x)=jSfj(xj)f(x)=\sum_{j\in S}f_{j}(x_{j})

for some unknown support set SS of size ss denoting the active covariates.

Most of the existing literature has addressed estimation of sparse additive models, primarily in the nonparametric regression setting and with \mathcal{H} being a reproducing kernel Hilbert space. After a series of works [33, 29, 35, 39], Raskutti et al. [38] (see also [29]) established that a penalized MM-estimator achieves the minimax estimation rate under various choices of the reproducing kernel Hilbert space \mathcal{H}. Yuan and Zhou [50] establish minimax estimation rates under a notion of approximate sparsity. As is now seen as typical of estimation theory, the powerful framework of empirical processes is brought down to bear on their proofs. Some articles have also addressed generalizations of the sparse additive model. For example, the authors of [49] consider, among other structures, an additive signal f(x)=j=1kfj(x)f(x)=\sum_{j=1}^{k}f_{j}(x) where each component function fjf_{j} is actually a multivariate function depending on at most sjs_{j} many coordinates and is αj\alpha_{j}-Hölder. The authors go on to derive minimax rates that handle heterogeneity in the smoothness indices and the sparsities of the coordinate functions; as a taste of their results, they show, in a particular regime and under some conditions, the rate kn2α2α+s+kslog(ps)/nkn^{-\frac{2\alpha}{2\alpha+s}}+ks\log\left(\frac{p}{s}\right)/n in the special, homogeneous case where sj=ss_{j}=s and αj=α\alpha_{j}=\alpha for all jj. Recently, the results of [3] show certain deep neural networks can achieve the minimax estimation rate for sparse kk-way interaction models. The kk-way interaction model is also known as nonparametric ANOVA. To give an example, the sparse 22-way interaction model assumes f(x)=jS1fj(xj)+(k,l)S2fkl(xk,xl)f(x)=\sum_{j\in S_{1}}f_{j}(x_{j})+\sum_{(k,l)\in S_{2}}f_{kl}(x_{k},x_{l}) where the sets of active variables S1S_{1} and interactions S2S_{2} have small cardinalities. When the fjf_{j} are β1\beta_{1}-Hölder and the fklf_{kl} are β2\beta_{2}-Hölder, [3] establishes, under some conditions and up to factors logarithmic in nn, the rate s1(n2β12β1+1+(logp)/n)+s2(n2β22β2+2+(logp)/n)s_{1}(n^{-\frac{2\beta_{1}}{2\beta_{1}+1}}+(\log p)/n)+s_{2}(n^{-\frac{2\beta_{2}}{2\beta_{2}+2}}+(\log p)/n).

The literature has much less to say on the problem of signal detection

H0\displaystyle H_{0} :f0,\displaystyle:f\equiv 0, (2)
H1\displaystyle H_{1} :f2ε and fs\displaystyle:||f||_{2}\geq\varepsilon\text{ and }f\in\mathcal{F}_{s} (3)

where s\mathcal{F}_{s} is the class of sparse additive signals given by (4). Adopting a minimax perspective [6, 23, 22, 21, 24], the goal is to determine the smallest rate ε\varepsilon as a function of the sparsity level ss, the dimension pp, the sample size nn, and the function space \mathcal{H} such that consistent detection of the alternative against the null is possible.

Though to a much lesser extent than the corresponding estimation theory, optimal testing rates have been established in various high dimensional settings other than sparse additive models. The most canonical setup, the Gaussian sequence model, is perhaps the most studied [24, 2, 41, 34, 30, 8, 14, 10, 7, 12, 18]. Optimal rates have also been established in linear regression [1, 26, 37] and other models [36, 13]. A common motif is that optimal testing rates exhibit different phenomena from optimal estimation rates.

Returning to (2)-(3), the only directly relevant works in the literature are Ingster and Lepski’s article [25] and a later article by Gayraud and Ingster [17]. Ingster and Lepski [25] consider a sparse multichannel model which, after a transformation to sequence space, is closely related to (2)-(3). Adopting an asymptotic setting and exclusively choosing \mathcal{H} to be a Sobolev space, they establish asymptotic minimax separation rates. However, their results only address the regimes p=O(s2)p=O(s^{2}) and s=O(p1/2δ)s=O(p^{1/2-\delta}) for a constant δ(0,1/2)\delta\in(0,1/2). Their result does not precisely pin down the rate near the phase transition sps\asymp\sqrt{p}. In essence, their testing procedure in the sparse regime is to apply a Bonferroni correction to a collection of χ2\chi^{2}-type tests, one test per each of the pp coordinates. Thus, a gap in their rate near p\sqrt{p} is unsurprising. In the dense regime, a single χ2\chi^{2}-type test is used, as is typical in minimax testing literature. Ingster and Lepski [25] also address adaptation to sparsity as well as adaptation both the sparsity and the smoothness. Ingster and Gayraud [17] consider sparse additive models rather than a sparse multichannel model but make the same choice of \mathcal{H} and work in an asymptotic setup. They establish sharp constants for the sparse case s=p1/2δs=p^{1/2-\delta} via a Higher Criticism type testing statistic. Throughout the paper, we make comparisons of our results to primarily [25] as it was the first article to establish rates. Our results do not address the question of sharp constants.

Our paper’s main contributions are the following. First, adopting a nonasymptotic minimax testing framework as initiated in [2], we establish the nonasymptotic minimax separation rate for (2)-(3) for any configuration of the sparsity ss, dimension pp, sample size nn, and a generic function space \mathcal{H}. Notably, we do not restrict ourselves to Sobolev (or Besov) space as in [25, 17]. The test procedure we analyze involves thresholded χ2\chi^{2} statistics, following a strategy employed in other sparse signal detection problems [10, 30, 8, 9, 34].

Our second contribution is to establish an adaptive testing rate for a generic function space. Typically, the sparsity level is unknown, and it is of practical interest to have a methodology which can accommodate a generic \mathcal{H}. Interestingly, some choices of \mathcal{H} do not involve any cost of adaptation, that is, the minimax rate can be achieved without knowing the sparsity. Our rate’s incurred adaptation cost turns out to be a delicate function of \mathcal{H}, thus extending Ingster and Lepski’s focus on Sobolev spaces [25]. Even in the Sobolev case, our result extends upon their article; near the regime sps\asymp\sqrt{p}, our test provides finer detail by incurring a cost involving only logloglogp\log\log\log p instead of loglogp\log\log p as incurred by their test. In principle, our result can be used to reverse the process and find a space \mathcal{H} for which this adaptive rate incurs a given adaptation cost.

Finally, adaptation to both sparsity and smoothness is studied in the context of Sobolev space. We identify an issue with and correct a claim made by [25].

1.1 Notation

The following notation will be used throughout the paper. For pp\in\mathbb{N}, let [p]:={1,,p}[p]:=\{1,...,p\}. For a,ba,b\in\mathbb{R}, denote ab:=max{a,b}a\vee b:=\max\{a,b\} and ab=min{a,b}a\wedge b=\min\{a,b\}. Denote aba\lesssim b to mean there exists a universal constant C>0C>0 such that aCba\leq Cb. The expression aba\gtrsim b means bab\lesssim a. Further, aba\asymp b means aba\lesssim b and bab\lesssim a. The symbol ,\langle\cdot,\cdot\rangle denotes the usual inner product in Euclidean space and ,F\langle\cdot,\cdot\rangle_{F} denotes the Frobenius inner product. The total variation distance between two probability measures PP and QQ on a measurable space (𝒳,𝒜)(\mathcal{X},\mathcal{A}) is defined as dTV(P,Q):=supA𝒜|P(A)Q(A)|d_{TV}(P,Q):=\sup_{A\in\mathcal{A}}|P(A)-Q(A)|. If QQ is absolutely continuous with respect to PP, the χ2\chi^{2}-divergence is defined as χ2(Q||P):=𝒳(dQdP1)2dP\chi^{2}(Q\,||\,P):=\int_{\mathcal{X}}\left(\frac{dQ}{dP}-1\right)^{2}\,dP. For a finite set SS, the notation |S||S| denotes the cardinality of SS. Throughout, iterated logarithms will be used (e.g. expressions like logloglogp\log\log\log p and loglog(np)\log\log(np)). Without explicitly stating so, we will take such an expression to be equal to some universal constant if otherwise it would be less than one. For example, logloglogp\log\log\log p should be understood to be equal to a universal constant when p<eeep<e^{e^{e}}.

1.2 Setup

1.2.1 Reproducing Kernel Hilbert Space (RKHS)

Following the literature (e.g. [47]), \mathcal{H} will denote a reproducing kernel Hilbert space (RKHS). Before discussing main results, basic properties of RKHSs are reviewed [47]. Suppose L2([0,1])\mathcal{H}\subset L^{2}([0,1]) is a reproducing kernel Hilbert space (RKHS) with associated inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}}. There exists a symmetric function K:[0,1]×[0,1]+K:[0,1]\times[0,1]\to\mathbb{R}_{+} called a kernel such that for any x[0,1]x\in[0,1] we have (1) the function K(,x)K(\cdot,x)\in\mathcal{H} and (2) for all ff\in\mathcal{H} we have f(x)=f,K(,x)f(x)=\langle f,K(\cdot,x)\rangle_{\mathcal{H}}. Mercer’s theorem (Theorem 12.20 in [47]) guarantees that the kernel KK admits an expansion in terms of eigenfunctions {ψk}k=1\left\{\psi_{k}\right\}_{k=1}^{\infty}, namely K(x,x)=k=1μkψk(x)ψk(x)K(x,x^{\prime})=\sum_{k=1}^{\infty}\mu_{k}\psi_{k}(x)\psi_{k}(x^{\prime}). To give examples, the kernel K(x,x)=1+min{x,x}K(x,x^{\prime})=1+\min\{x,x^{\prime}\} defines the first-order Sobolev space with eigenvalue decay μkk2\mu_{k}\asymp k^{-2}, and the kernel K(x,x)=exp((xx)22)K(x,x^{\prime})=\exp\left(-\frac{(x-x^{\prime})^{2}}{2}\right) exhibits eigenvalue decay μkecklogk\mu_{k}\asymp e^{-ck\log k} (see [47] for a more detailed review).

Without loss of generality, we order the eigenvalues μ1μ20\mu_{1}\geq\mu_{2}\geq...\geq 0. The eigenfunctions {ψk}k=1\left\{\psi_{k}\right\}_{k=1}^{\infty} are orthonormal in L2([0,1])L^{2}([0,1]) under the usual L2L^{2} inner product ,L2\langle\cdot,\cdot\rangle_{L^{2}} and the inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}} satisfies f,g=k=1μk1f,ψkL2g,ψkL2\langle f,g\rangle_{\mathcal{H}}=\sum_{k=1}^{\infty}\mu_{k}^{-1}\langle f,\psi_{k}\rangle_{L^{2}}\langle g,\psi_{k}\rangle_{L^{2}} for f,gf,g\in\mathcal{H}.

1.2.2 Parameter space

The parameter space which will be used throughout the paper is defined in this section. Suppose \mathcal{H} is an RKHS. Recall we are interested in sparse additive signals f(x)=jSfj(xj)f(x)=\sum_{j\in S}f_{j}(x_{j}) for some sparsity pattern S[p]S\subset[p]. Following [17, 38], for convenience we assume 01fj(t)𝑑t=0\int_{0}^{1}f_{j}(t)\,dt=0 for all jj. This assumption is mild and can be relaxed; further discussion can be found in Section 5.1. Letting 𝟏L2([0,1])\mathbf{1}\in L^{2}([0,1]) denote the constant function equal to one, consider that 0:=span{𝟏}\mathcal{H}_{0}:=\mathcal{H}\cap\operatorname{span}\left\{\mathbf{1}\right\}^{\perp} is a closed subspace of \mathcal{H}. Hence, 0\mathcal{H}_{0} is also an RKHS. We will put aside \mathcal{H} (along with its eigenfunctions and eigenvalues) and only work with 0\mathcal{H}_{0}. Let {ψk}k=1\{\psi_{k}\}_{k=1}^{\infty} and {μk}k=1\left\{\mu_{k}\right\}_{k=1}^{\infty} denote its associated eigenfunctions and eigenvalues respectively. Following [2], we assume μ1=1\mu_{1}=1 also for convenience; this can easily be relaxed.

For each subset S[p]S\subset[p], define

S:={f(x)=jSfj(xj):fj0 and fj01 for all jS}.\mathcal{H}_{S}:=\left\{f(x)=\sum_{j\in S}f_{j}(x_{j}):f_{j}\in\mathcal{H}_{0}\text{ and }||f_{j}||_{\mathcal{H}_{0}}\leq 1\text{ for all }j\in S\right\}.

The condition on the RKHS norm enforces regularity. Define the parameter space

s:=S[p],|S|sS\mathcal{F}_{s}:=\bigcup_{\begin{subarray}{c}S\subset[p],\\ |S|\leq s\end{subarray}}\mathcal{H}_{S} (4)

for each 1sp1\leq s\leq p.

Following [25, 17], it is convenient to transform (1) from function space to sequence space. The tensor product L2([0,1])pL^{2}([0,1])^{\otimes p} admits the orthonormal basis {Φ}p\left\{\Phi_{\ell}\right\}_{\ell\in\mathbb{N}^{p}} with Φ(t)=j=1pψj(tj)\Phi_{\ell}(t)=\prod_{j=1}^{p}\psi_{\ell_{j}}(t_{j}) for t[0,1]pt\in[0,1]^{p}. For ease of notation, denote ϕj,k(t)=ψk(tj)\phi_{j,k}(t)=\psi_{k}(t_{j}) for kk\in\mathbb{N} and j[p]j\in[p]. Define the random variables

Xk,j\displaystyle X_{k,j} :=[0,1]pϕj,k(x)𝑑YxN(θk,j,n1)\displaystyle:=\int_{[0,1]^{p}}\phi_{j,k}(x)\,dY_{x}\sim N(\theta_{k,j},n^{-1}) (5)

where θk,j=01ψk(x)fj(x)𝑑x\theta_{k,j}=\int_{0}^{1}\psi_{k}(x)f_{j}(x)\,dx. The assumption 01fj(t)𝑑t=0\int_{0}^{1}f_{j}(t)\,dt=0 for all jj is used here. Note by orthogonality that {Xk,j}k,j[p]\{X_{k,j}\}_{k\in\mathbb{N},j\in[p]} is a collection of independent random variables. The notation Θ×p\Theta\in\mathbb{R}^{\mathbb{N}\times p} will frequently be used to denote the full matrix of coefficients. The notation Θj\Theta_{j}\in\mathbb{R}^{\mathbb{N}} will also be used to denote the jjth column of Θ\Theta. For fsf\in\mathcal{F}_{s}, the corresponding set of coefficients is

𝒯s:={Θ×p:j=1p𝟙{Θj0}s and k=1θk,j2μk1 for all j}.\mathscr{T}_{s}:=\left\{\Theta\in\mathbb{R}^{\mathbb{N}\times p}:\sum_{j=1}^{p}\mathbbm{1}_{\{\Theta_{j}\neq 0\}}\leq s\text{ and }\sum_{k=1}^{\infty}\frac{\theta_{k,j}^{2}}{\mu_{k}}\leq 1\text{ for all }j\right\}. (6)

Note the parameter spaces s\mathcal{F}_{s} and 𝒯s\mathscr{T}_{s} are in correspondence, and we will frequently write ff and Θ\Theta freely in the same context without comment. The understanding is ff is a function and Θ\Theta is its corresponding basis coefficients. The notation Ef,Pf,EΘE_{f},P_{f},E_{\Theta}, and PΘP_{\Theta} will be used to denote expectations and probability measures with respect to the denoted parameters.

1.2.3 Problem

As described earlier, given an observation from (1) the signal detection problem (2)-(3) is of interest. The goal is to characterize the nonasymptotic minimax separation rate ε=ε(p,s,n)\varepsilon_{\mathcal{H}}^{*}=\varepsilon_{\mathcal{H}}^{*}(p,s,n).

Definition 1.

We say ε=ε(p,s,n)\varepsilon^{*}_{\mathcal{H}}=\varepsilon_{\mathcal{H}}^{*}(p,s,n) is the nonasymptotic minimax separation rate for the problem (2)-(3) if

  1. (i)

    for all η(0,1)\eta\in(0,1), there exists Cη>0C_{\eta}>0 depending only on η\eta such that for all C>CηC>C_{\eta},

    infφ{P0{φ0}+supfs,f2Cε(p,s,n)Pf{φ1}}η,\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\varepsilon_{\mathcal{H}}^{*}(p,s,n)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\leq\eta,
  2. (ii)

    for all η(0,1)\eta\in(0,1), there exists cη>0c_{\eta}>0 depending only on η\eta such that for all 0<c<cη0<c<c_{\eta},

    infφ{P0{φ0}+supfs,f2cε(p,s,n)Pf{φ1}}1η.\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\varepsilon_{\mathcal{H}}^{*}(p,s,n)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta.

The nonasymptotic minimax testing framework was initiated by Baraud [2] and characterizes (up to universal factors) the fundamental statistical limit of the detection problem. The framework is nonasymptotic in the sense that the conditions for the minimax separation rate hold for any configuration of the problem parameters.

Since f22=j=1pk=1θk,j2||f||_{2}^{2}=\sum_{j=1}^{p}\sum_{k=1}^{\infty}\theta_{k,j}^{2}, the problem can be equivalently formulated in sequence space as

H0\displaystyle H_{0} :Θ0,\displaystyle:\Theta\equiv 0, (7)
H1\displaystyle H_{1} :ΘFε and Θ𝒯s.\displaystyle:||\Theta||_{F}\geq\varepsilon\text{ and }\Theta\in\mathscr{T}_{s}. (8)

This testing problem, with the parameter space (6), is interesting its own right outside the sparse additive model and RKHS context. Indeed, the testing problem (7)-(8) is essentially a group-sparse extension of the detection problem in ellipses considered in Baraud’s seminal article [2]. In fact, this interpretation was actually our initial motivation to study the detection problem. The connection to sparse additive models was a later consideration; similar to the way in which the later article [17] considers sparse additive models when building upon the earlier, fundamental work [25] dealing with a sparse multichannel (essentially group-sparse) model. Taking the perspective of a sequence problem has a long history in nonparametric regression [24, 25, 2, 15, 27, 48, 40, 5] due to not only its fundamental connections but also its advantage in distilling the problem to its essence and dispelling technical distractions. Our results can be exclusively read (and readers are encouraged to do so) in the context of the sequence problem (7)-(8).

2 Minimax rates

We begin by describing some high-level and natural intuition before informally stating our main result in Section 2.1. Section 2.2 contains the development of some key quantities. In Sections 2.3 and 2.4, we formally state minimax lower and upper bounds respectively. In Section 2.5, some special cases illustrating the general result are discussed.

2.1 A naive ansatz

A first instinct is to look to previous results for context in an attempt to make a conjecture regarding the optimal testing rate. To illustrate how this line of thinking might proceed, consider the classical Gaussian white noise model on the unit interval in one dimension,

dYx=f(x)dx+1ndBxdY_{x}=f(x)\,dx+\frac{1}{\sqrt{n}}dB_{x}

for x[0,1]x\in[0,1]. Assume ff lives inside the unit ball of a reproducing kernel Hilbert space \mathcal{H} and thus admits a decomposition in the associated orthonormal basis with a coefficient vector θ2()\theta\in\ell^{2}(\mathbb{N}) living in an ellipsoid determined by the kernel’s eigenvalues μ1μ20\mu_{1}\geq\mu_{2}\geq...\geq 0. The optimal rate of estimation is given by Birgé and Massart [4]

ϵest2maxν{μννn}.\epsilon_{\text{est}}^{2}\asymp\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\nu}{n}\right\}.

Baraud [2] established that the minimax separation rate for the signal detection problem

H0\displaystyle H_{0} :f0,\displaystyle:f\equiv 0,
H1\displaystyle H_{1} :f2ε and f1\displaystyle:||f||_{2}\geq\varepsilon\text{ and }||f||_{\mathcal{H}}\leq 1

is given by

ϵtest2maxν{μννn}.\epsilon_{\text{test}}^{2}\asymp\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu}}{n}\right\}.

In both estimation and testing, the maximizer ν\nu^{*} can be conceptualized as the correct truncation order. Specifically for testing, Baraud’s procedure [2], working in sequence space, rejects H0H_{0} when k=1νXk2νnνn\sum_{k=1}^{\nu^{*}}X_{k}^{2}-\frac{\nu^{*}}{n}\gtrsim\frac{\sqrt{\nu^{*}}}{n}. The data for k>νk>\nu^{*} are not used, and it is in this sense ν\nu^{*} is understood as a truncation level. To illustrate these results, the rates for Sobolev spaces with smoothness α\alpha are ϵest2n2α2α+1\epsilon_{\text{est}}^{2}\asymp n^{-\frac{2\alpha}{2\alpha+1}} and ϵtest2n4α4α+1\epsilon_{\text{test}}^{2}\asymp n^{-\frac{4\alpha}{4\alpha+1}} since μνν2α\mu_{\nu}\asymp\nu^{-2\alpha}. By now, these nonasymptotic results of [4, 2] are well known and canonical.

Moving to the setting of sparse additive signals in the model (1), Raskutti et al. [38] derive the nonasymptotic minimax rate of estimation

slogpn+sϵest2.\frac{s\log p}{n}+s\epsilon_{\text{est}}^{2}.

While their upper bound holds for any choice of \mathcal{H} (satisfying some mild conditions) and sparsity level ss, they only obtain a matching lower bound when s=o(p)s=o(p) and when the unit ball of \mathcal{H} has logarithmic or polynomial scaling metric entropy. This rate obtained by Raskutti et al. [38] is pleasing. It is quite intuitive to see the term slogpn\frac{s\log p}{n} due to the sparsity in the parameter space. The term sϵest2s\epsilon_{\text{est}}^{2} is the natural rate for estimating ss many univariate functions in \mathcal{H} as if sparsity pattern were known. Notably, there is no interaction between the choice of \mathcal{H} and the sparsity structure. The sparsity term slogpn\frac{s\log p}{n} is independent of \mathcal{H} and the estimation term sϵest2s\epsilon_{\text{est}}^{2} is dimension free.

One might intuit that this lack of interaction is a general phenomenon. Instinct may suggest that signal detection in sparse additive models is also just ss many instances of a univariate nonparametric detection problem plus the problem of a detecting a signal which is nonzero on an unknown sparsity pattern. Framing it as two distinct problems, one might conjecture the optimal testing rate should be sϵtest2s\epsilon_{\text{test}}^{2} plus the ss-sparse detection rate. Collier et al. [10] provide a natural candidate for the ss-sparse testing rate, namely the rate slog(ps2)n\frac{s\log\left(\frac{p}{s^{2}}\right)}{n} for s<ps<\sqrt{p} and pn\frac{\sqrt{p}}{n} for sps\geq\sqrt{p}.

However, this is not the case as a quick glance at [25] falsifies the conjecture for the case of Sobolev \mathcal{H}. Though quick to dispel hopeful thinking, [25] expresses little about the interaction between sparsity and \mathcal{H}. Our result explicitly captures this nontrivial interaction for a generic \mathcal{H}. We show the minimax separation rate is given by

ε(p,s,n)2\displaystyle\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2} s{slog(ps2)n+smaxν{μννlog(1+ps2)n}if s<p,smaxν{μννlog(1+ps2)n}if sp.\displaystyle\asymp s\wedge\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+s\cdot\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}&\textit{if }s<\sqrt{p},\\ s\cdot\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}&\textit{if }s\geq\sqrt{p}.\end{cases} (9)

The rate bears some resemblance to the sparse testing rate of Collier et al. [10] and the nonparametric testing rate of Baraud [2], but the combination of the two is not a priori straightforward. At this point in discussion, not enough has been developed to explain how the form of the rate arises. Various features of the rate will be commented on later on in the paper.

2.2 Preliminaries

In this section, some key pieces are defined.

Definition 2.

Define Γ=Γ(p,s,n)\Gamma_{\mathcal{H}}=\Gamma_{\mathcal{H}}(p,s,n) to be the quantity

Γ:=maxν{μννlog(1+ps2)n}.\Gamma_{\mathcal{H}}:=\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}.

Note, since μ1=1\mu_{1}=1, it follows Γ1n\Gamma_{\mathcal{H}}\gtrsim\frac{1}{n}. It is readily seen from (9) there are two broad regimes to consider. When nlog(1+p/s2)n\lesssim\log(1+p/s^{2}), we have ε(p,s,n)2s\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp s. In the regime nlog(1+p/s2)n\gtrsim\log(1+p/s^{2}), the rate is more complicated. The first regime is really a trivial regime since any signal fsf\in\mathcal{F}_{s} must satisfy f22s||f||_{2}^{2}\leq s by virtue of μ1=1\mu_{1}=1. Therefore, the degenerate test which always accepts H0H_{0} vacuously detects sparse additive signals with f222s||f||_{2}^{2}\geq 2s. Hence, the upper bound ε(p,s,n)2s\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\lesssim s is trivially achieved. It turns out a matching lower bound can be proved which establishes that the regime nlog(1+p/s2)n\lesssim\log(1+p/s^{2}) is fundamentally trivial; see Section 2.3 for a formal statement.

More generally, the form (9) is useful when discussing lower bounds. A different form is more convenient when discussing upper bounds, and it is a form which is familiar in the context of [10, 11].

Definition 3.

Define ν\nu_{\mathcal{H}} to be the smallest positive integer ν\nu such that

μννlog(1+ps2)n.\mu_{\nu}\leq\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}.

As the next lemma shows, ν\nu_{\mathcal{H}} is essentially the solution to the maximization problem in the definition of Γ\Gamma_{\mathcal{H}}. Drawing an analogy to the result of Baraud [2] described in Section 2.1, ν\nu_{\mathcal{H}} can be conceptualized as the correct order of truncation accounting for the dimension and sparsity.

Lemma 1.

If log(1+ps2)n2\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}, then Γνlog(1+ps2)n2Γ\Gamma_{\mathcal{H}}\leq\frac{\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}\leq\sqrt{2}\Gamma_{\mathcal{H}}.

With Lemma 1, the testing rate can be expressed as

ε(p,s,n)2{snlog(1+ps2)+snνlog(1+ps2)if s<p,pνnif sp.\varepsilon^{*}_{\mathcal{H}}(p,s,n)^{2}\asymp\begin{cases}\frac{s}{n}\log\left(1+\frac{p}{s^{2}}\right)+\frac{s}{n}\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}&\textit{if }s<\sqrt{p},\\ \frac{\sqrt{p\nu_{\mathcal{H}}}}{n}&\textit{if }s\geq\sqrt{p}.\end{cases} (10)

The condition log(1+ps2)n\log\left(1+\frac{p}{s^{2}}\right)\lesssim n in Lemma 1 is natural in light of the triviality which occurs when nlog(1+ps2)n\lesssim\log\left(1+\frac{p}{s^{2}}\right).

2.3 Lower bound

In this section, a formal statement of a lower bound on the minimax separation rate is given. Define

ψ(p,s,n)2:={slog(1+ps2)nsΓif s<p,sΓif sp.\psi(p,s,n)^{2}:=\begin{cases}\frac{s\log\left(1+\frac{p}{s^{2}}\right)}{n}\vee s\Gamma_{\mathcal{H}}&\textit{if }s<\sqrt{p},\\ s\Gamma_{\mathcal{H}}&\textit{if }s\geq\sqrt{p}.\end{cases} (11)

First, it is shown that the testing problem is trivial if log(ps2)n\log\left(\frac{p}{s^{2}}\right)\gtrsim n.

Proposition 1 (Triviality).

Suppose 1sp1\leq s\leq p. Suppose κ>0\kappa>0 and log(1+ps2)κn\log\left(1+\frac{p}{s^{2}}\right)\geq\kappa n. If η(0,1)\eta\in(0,1), then for any 0<c<1κκlog(1+4η2)0<c<1\wedge\sqrt{\kappa}\wedge\sqrt{\kappa\log\left(1+4\eta^{2}\right)} we have

infφ{P0{φ0}+supfs,f2csPf{φ1}}1η.\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\sqrt{s}\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta.

Proposition 1 asserts that in order to achieve small testing error, a necessary condition is that f2Cs||f||_{2}\geq C\sqrt{s} for a sufficiently large C>0C>0. To see why Proposition 1 is a statement about triviality, observe that s{fL2([0,1]p):f2s}\mathcal{F}_{s}\subset\left\{f\in L^{2}([0,1]^{p}):||f||_{2}\leq\sqrt{s}\right\}. To reiterate plainly, all potential fsf\in\mathcal{F}_{s} in the model (1) live inside the ball of radius s\sqrt{s}. There are essentially no functions fsf\in\mathcal{F}_{s} that have detectable norm when log(ps2)n\log\left(\frac{p}{s^{2}}\right)\gtrsim n, and so the problem is essentially trivial.

The lower bound construction is straightforward. Working in sequence space, consider an alternative hypothesis with a prior π\pi in which a draw Θπ\Theta\sim\pi is obtained by drawing a size ss subset S[p]S\subset[p] uniformly at random and setting θk,j=ρ\theta_{k,j}=\rho if k=1k=1 and jSj\in S or setting θk,j=0\theta_{k,j}=0 otherwise. The value of ρ\rho determines the separation between the null and alternative hypotheses since ΘF2=sρ2||\Theta||_{F}^{2}=s\rho^{2}. However, ρ\rho must respect some constraints to ensure π\pi is supported on the parameter space and that it is impossible to distinguish the null and alternative hypotheses with vanishing error. Observe this construction places us in a one-dimensional sparse sequence model (since θk,j=0\theta_{k,j}=0 for k2k\geq 2), which is precisely the setting of [10]. From their results, it is seen we must have ρ2log(1+ps2)/n\rho^{2}\lesssim\log\left(1+\frac{p}{s^{2}}\right)/n. To ensure π\pi is supported on the parameter space, we must have k=1θk,j2/μk1\sum_{k=1}^{\infty}\theta_{k,j}^{2}/\mu_{k}\leq 1 for all j[p]j\in[p]. Since μ1=1\mu_{1}=1, it follows k=1θk,j2/μk=θ1,j2ρ2\sum_{k=1}^{\infty}\theta_{k,j}^{2}/\mu_{k}=\theta_{1,j}^{2}\leq\rho^{2}, and so the constraint ρ1\rho\lesssim 1 must be enforced. When log(1+p/s2)n\log\left(1+p/s^{2}\right)\gtrsim n, only the second condition ρ1\rho\lesssim 1 is binding, and so the largest separation we can achieve is ΘF2s||\Theta||_{F}^{2}\asymp s. Hence, the problem is trivial in this regime.

To ensure non-triviality, it will be assumed log(1+ps2)n2\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}. The choice of the factor 1/21/2 is only for convenience and is not essential. In fact, the condition log(1+ps2)<n\log\left(1+\frac{p}{s^{2}}\right)<n would always suffice for our purposes, and the condition log(1+ps2)n\log\left(1+\frac{p}{s^{2}}\right)\leq n would also suffice for n>1n>1. The following theorem establishes ε(p,s,n)2ψ(p,s,n)2\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\gtrsim\psi(p,s,n)^{2}.

Theorem 1.

Suppose 1sp1\leq s\leq p and log(1+ps2)n2\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}. If η(0,1)\eta\in(0,1), then there exists cη>0c_{\eta}>0 depending only on η\eta such that for any 0<c<cη0<c<c_{\eta} we have

infφ{P0{φ0}+supfs,f2cψ(p,s,n)Pf{φ1}}1η\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\psi(p,s,n)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where ψ\psi is given by (11).

The lower bound is proved via Le Cam’s two point method (also known as the method of “fuzzy” hypotheses [45]) which is standard in the minimax hypothesis testing literature [24]. The prior distribution employed in the argument is a natural construction. Namely, the active set (i.e. the size ss set of nonzero coordinate functions) is selected uniformly at random and the corresponding nonzero coordinate functions are drawn from the usual univariate nonparametric prior employed in the literature [2, 24].

2.4 Upper bound

In this section, testing procedures are constructed and formal statements establishing rate-optimality are made. The form of the rate in (10) is a more convenient target, and should be kept in mind when reading the statements of our upper bounds.

2.4.1 Hard thresholding in the sparse regime

In this section, the sparse regime s<ps<\sqrt{p} is discussed. For any dd\in\mathbb{N} and j[p]j\in[p], define Ej(d)=nkdXk,j2E_{j}(d)=n\sum_{k\leq d}X_{k,j}^{2} where the data {Xk,j}k,j[p]\{X_{k,j}\}_{k\in\mathbb{N},j\in[p]} is defined via transformation to sequence space (5). For any r0r\geq 0, define

Tr(d):=j=1p(Ej(d)αr(d))𝟙{Ej(d)d+r2}T_{r}(d):=\sum_{j=1}^{p}\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}} (12)

where

αr(d):=E(g2𝟙{g2d+r2})P{g2d+r2}\alpha_{r}(d):=\frac{E\left(||g||^{2}\mathbbm{1}_{\{||g||^{2}\geq d+r^{2}\}}\right)}{P\{||g||^{2}\geq d+r^{2}\}} (13)

where gN(0,Id)g\sim N(0,I_{d}). Note αr(d)\alpha_{r}(d) is a conditional expectation under the null hypothesis. The random variable Tr(d)T_{r}(d) will be used as a test statistic. Such a statistic was first defined by Collier et al. [10], and similar statistics have been successfully employed in other signal detection problems [9, 34, 30, 8]. However, all previous works in the literature have only used this statistic with d=1d=1. For our problem, it is necessary to take growing dd. Consequently, a more refined analysis is necessary, and essentially new phenomena appear in the upper bound as we discuss later. As noted in Section 2.2, the quantity ν\nu_{\mathcal{H}} can be conceptualized of as the correct truncation order, that is, it turns out the correct choice is dνd\asymp\nu_{\mathcal{H}}. The choice of rr is more complicated, and in fact there are two separate regimes depending on the size of dd. The regime in which dlog(1+ps2)d\gtrsim\log\left(1+\frac{p}{s^{2}}\right) is referred to as the “bulk” regime. The complementary regime dlog(1+ps2)d\lesssim\log\left(1+\frac{p}{s^{2}}\right) is referred to as the “tail” regime.

Proposition 2 (Bulk).

Set d=νDd=\nu_{\mathcal{H}}\vee\lceil D\rceil where DD is the universal constant from Lemma 15. There exist universal positive constants K1,K2,K_{1},K_{2}, and K3K_{3} such that the following holds. Suppose 1s<p1\leq s<\sqrt{p} and log(1+ps2)K32d\log\left(1+\frac{p}{s^{2}}\right)\leq K_{3}^{2}d. Set

τ(p,s,n)2=sνlog(1+ps2)n.\tau(p,s,n)^{2}=\frac{s\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}.

If η(0,1)\eta\in(0,1), then there exists Cη>0C_{\eta}>0 depending only on η\eta such that for all C>CηC>C_{\eta}, we have

P0{Tr(d)>CK1nτ(p,s,n)2}+supfs,f2Cτ(p,s,n)Pf{Tr(d)CK1nτ(p,s,n)2}ηP_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta

where r=K2(dlog(1+ps2))1/4r=K_{2}\left(d\log\left(1+\frac{p}{s^{2}}\right)\right)^{1/4}. Here, Tr(d)T_{r}(d) is given by (12).

Proposition 3 (Tail).

Set d=νDd=\nu_{\mathcal{H}}\vee\lceil D\rceil where DD is the universal constant from Lemma 11. Let K3K_{3} denote the universal positive constant from Proposition 2. There exist universal positive constants K1K_{1} and K2K_{2} such that the following holds. Suppose 1s<p1\leq s<\sqrt{p} and log(1+ps2)K32d\log\left(1+\frac{p}{s^{2}}\right)\geq K_{3}^{2}d. Set

τ(p,s,n)2=slog(1+ps2)n.\tau(p,s,n)^{2}=\frac{s\log\left(1+\frac{p}{s^{2}}\right)}{n}.

If η(0,1)\eta\in(0,1), then there exists Cη>0C_{\eta}>0 depending only on η\eta such that for all C>CηC>C_{\eta}, we have

P0{Tr(d)>CK1nτ(p,s,n)2}+supfs,f2Cτ(p,s,n)Pf{Tr(d)CK1nτ(p,s,n)2}ηP_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta

where r=K2log(1+ps2)r=K_{2}\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}. Here, Tr(d)T_{r}(d) is given by (12).

Propositions 3 and 2 thus imply that, for s<ps<\sqrt{p}, the minimax separation rate is upper bounded by slog(1+ps2)/n+sνlog(1+ps2)/ns\log\left(1+\frac{p}{s^{2}}\right)/n+s\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}/n. By Lemma 1, it follows that

ε(p,s,n)slog(1+ps2)n+sΓ\varepsilon^{*}_{\mathcal{H}}(p,s,n)\lesssim\frac{s\log\left(1+\frac{p}{s^{2}}\right)}{n}+s\Gamma_{\mathcal{H}}

under the condition that log(1+ps2)n2\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}. As established by Proposition 1 in Section 2.3, a condition like this is essential to avoid triviality.

Propositions 3 and 2 reveal an interesting phase transition in the minimax separation rate at the point

νlog(1+ps2).\nu_{\mathcal{H}}\asymp\log\left(1+\frac{p}{s^{2}}\right).

This phase transition phenomena is driven by the tail behavior of χd2\chi^{2}_{d}. Consider that under the null distribution Ej(d)χd2E_{j}(d)\sim\chi^{2}_{d} for all jj, and so the statistic Tr(d)T_{r}(d) is the sum of pp independent and thresholded χd2\chi^{2}_{d} random variables. By a well-known lemma of Laurent and Massart [31] (also from Bernstein’s inequality up to constants), for any u>0u>0 we have

P{χd2d2du+2u}eu.P\left\{\chi^{2}_{d}-d\geq 2\sqrt{du}+2u\right\}\leq e^{-u}. (14)

Roughly speaking, χd2d\chi^{2}_{d}-d exhibits subgaussian-type deviation in the “bulk” udu\lesssim\sqrt{d} and subexponential-type deviation in the “tail” udu\gtrsim\sqrt{d}. Consequently, Tr(d)T_{r}(d) should be intuited as a sum of thresholded subgaussian random variables when rdr\lesssim\sqrt{d}, and as a sum of thresholded subexponential random variables when rdr\gtrsim\sqrt{d}.

Examining (14), in the “tail” udu\gtrsim\sqrt{d} it follows 2du+2uu2\sqrt{du}+2u\asymp u, which no longer exhibits dependence on dd. Analogously, in Proposition 3 the threshold is taken as rdr\gtrsim\sqrt{d} and so the resulting rate exhibits no dependence on dd and consequently no dependence on \mathcal{H}. On the other hand, in the “bulk” udu\lesssim\sqrt{d} it follows 2du+2udu2\sqrt{du}+2u\asymp\sqrt{du}. Analogously, in Proposition 2 the threshold is taken as rdr\lesssim\sqrt{d} and so the resulting rate indeed depends on dd and thus on \mathcal{H}.

2.4.2 χ2\chi^{2} tests in the dense regime

The situation is less complicated in the dense regime sps\geq\sqrt{p} as it suffices to use the χ2\chi^{2} testing statistic T:=nj=1pkνXk,j2T:=n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}X_{k,j}^{2}.

Proposition 4.

Suppose sps\geq\sqrt{p}. Set

τ(p,s,n)2=pνn.\tau(p,s,n)^{2}=\frac{\sqrt{p\nu_{\mathcal{H}}}}{n}.

If η(0,1)\eta\in(0,1), then there exists Cη>0C_{\eta}>0 depending only on η\eta such that for all C>CηC>C_{\eta}, we have

P0{T>pν+2ηpν}+supfs,f2Cτ(p,s,n)Pf{Tpν+2ηpν}η.P_{0}\left\{T>p\nu_{\mathcal{H}}+\frac{2}{\sqrt{\eta}}\sqrt{p\nu_{\mathcal{H}}}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T\leq p\nu_{\mathcal{H}}+\frac{2}{\sqrt{\eta}}\sqrt{p\nu_{\mathcal{H}}}\right\}\leq\eta.

As in the sparse case, (10) Lemma 1 asserts τ(p,s,n)2sΓ\tau(p,s,n)^{2}\asymp s\Gamma_{\mathcal{H}} provided that the condition log(1+ps2)n/2\log\left(1+\frac{p}{s^{2}}\right)\leq n/2 is satisfied.

2.5 Special cases

Having formally stated lower and upper bounds for the minimax separation rate, it is informative to explore a number of special cases and witness a variety of phenomena. Throughout the illustrations it will be assumed log(1+ps2)n/2\log\left(1+\frac{p}{s^{2}}\right)\leq n/2 as discussed earlier.

2.5.1 Sobolev

Taking the case μkk2α\mu_{k}\asymp k^{-2\alpha} as emblematic of Sobolev space with smoothness α>0\alpha>0, we obtain the minimax separation rate

ε(p,s,n)2{slog(ps2)n+s(nlog(ps2))4α4α+1if s<p,s(nsp)4α4α+1if sp.\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+s\left(\frac{n}{\sqrt{\log\left(\frac{p}{s^{2}}\right)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s<\sqrt{p},\\ s\left(\frac{ns}{\sqrt{p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s\geq\sqrt{p}.\end{cases}

It is useful to compare to the rates obtained by Ingster and Lepski (Theorem 2 in [25]) (see also [17]) although their choice of parameter is space is slightly different from that considered in this paper. In the sparse case s=O(p1/2δ)s=O\left(p^{1/2-\delta}\right) for a constant δ(0,1/2]\delta\in(0,1/2], their rate is

{s(nlogp)4α4α+1if logpn12α+1,slogpnif logp>n12α+1.\begin{cases}s\left(\frac{n}{\sqrt{\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log p\leq n^{\frac{1}{2\alpha+1}},\\ s\frac{\log p}{n}&\textit{if }\log p>n^{\frac{1}{2\alpha+1}}.\end{cases}

In the dense regime p=O(s2)p=O(s^{2}), their rate (Theorem 1 in [25]) is s(nsp)4α4α+1s\left(\frac{ns}{\sqrt{p}}\right)^{-\frac{4\alpha}{4\alpha+1}}. Quick algebra verifies that our rate indeed matches Ingster and Lepski’s rate in these sparsity regimes.

In the sparse regime, the strange looking phase transition at logpn12α+1\log p\asymp n^{\frac{1}{2\alpha+1}} in their rate now has a coherent explanation in view of our result. The situation logpn12α+1\log p\lesssim n^{\frac{1}{2\alpha+1}} corresponds to the “bulk” regime in which case Tr(d)T_{r}(d) from (12) behaves like a sum of thresholded subgaussian random variables. On the other side where logpn12α+1\log p\gtrsim n^{\frac{1}{2\alpha+1}}, subexponential behavior is exhibited by Tr(d)T_{r}(d). In fact, our result gives more detail beyond sp1/2δs\lesssim p^{1/2-\delta}. Assume only s<ps<\sqrt{p} (for example, s=plogps=\frac{\sqrt{p}}{\log p} is allowed now). Then the phase transition between the “bulk” and “tail” regimes actually occurs at log(1+ps2)n12α+1\log\left(1+\frac{p}{s^{2}}\right)\asymp n^{\frac{1}{2\alpha+1}}.

2.5.2 Finite dimension

Consider a finite dimensional situation, that is 1=μ1=μ2==μm>μm+1=μm+2==01=\mu_{1}=\mu_{2}=...=\mu_{m}>\mu_{m+1}=\mu_{m+2}=...=0 for some positive integer mm. Function spaces exhibiting this kind of structure include linear functions, finite polynomials, and generally RKHSs based on kernels with finite rank. If m<n2log(1+ps2)m<\frac{n^{2}}{\log\left(1+\frac{p}{s^{2}}\right)}, the minimax separation rate is

ε(p,s,n)2{slog(ps2)n+smlog(ps2)nif s<p,pmnif sp.\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+\frac{s\sqrt{m\log\left(\frac{p}{s^{2}}\right)}}{n}&\textit{if }s<\sqrt{p},\\ \frac{\sqrt{pm}}{n}&\textit{if }s\geq\sqrt{p}.\end{cases}

In the sparse regime, the phase transition between the bulk and tail regimes occurs at log(1+ps2)m\log\left(1+\frac{p}{s^{2}}\right)\asymp m.

2.5.3 Exponential decay

As another example, consider exponential decay of the eigenvalues μk=c1ec2kγ\mu_{k}=c_{1}e^{-c_{2}k^{\gamma}} where c1,c2c_{1},c_{2} are universal constants and γ>0\gamma>0. Such decay is a feature of RKHSs based on Gaussian kernels. The minimax separation rate is

ε(p,s,n)2{slog(ps2)n+slog12γ(nlog(ps2))log(ps2)nif s<ppnlog12γ(nsp)if sp.\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+s\log^{\frac{1}{2\gamma}}\left(\frac{n}{\sqrt{\log\left(\frac{p}{s^{2}}\right)}}\right)\cdot\frac{\sqrt{\log\left(\frac{p}{s^{2}}\right)}}{n}&\textit{if }s<\sqrt{p}\\ \frac{\sqrt{p}}{n}\log^{\frac{1}{2\gamma}}\left(\frac{ns}{\sqrt{p}}\right)&\textit{if }s\geq\sqrt{p}.\end{cases}

In the sparse regime, the phase transition between the bulk and tail regimes occurs at logγ(1+ps2)logn\log^{\gamma}\left(1+\frac{p}{s^{2}}\right)\asymp\log n. The minimax separation rate is quite close to the finite dimensional rate, which is sensible as RKHSs based on Gaussian kernels are known to be fairly “small” nonparametric function spaces [47].

3 Adaptation

Thus far the sparsity parameter ss has been assumed known, and the tests constructed make use of this information. In practice, the statistician is typically ignorant of the sparsity level and so it is of interest to understand whether adaptive tests can be furnished. In this section, we will establish an adaptive testing rate which accommodates a generic \mathcal{H}, and it turns out to exhibit a cost for adaptation which depends delicately on the function space.

To the best of our knowledge, Spokoiny’s article was the first to demonstrate an unavoidable cost for adaptation in a signal detection problem [41]. Later work established unavoidable costs for adaptive testing across a variety of situations (see [24] and references therein). This early work largely focused on adapting to an unknown smoothness parameter in a univariate nonparametric regression setting. More recently adaptation to sparsity in high dimensional models has been studied. In many problems, one can adapt to sparsity without cost in the rate (nor in the constant for some problems) [14, 34, 30, 26, 1, 18]. In the context of sparse additive models in Sobolev space, Ingster and Lepski [25] (see also [17]) consider adaptation, and we discuss their results in Section 3.4.

3.1 Preliminaries

To formally state our result, some slight generalizations of the concepts found in Section 2.2 are needed.

Definition 4.

For a>0a>0, define ν(s,a)\nu_{\mathcal{H}}(s,a) to be the smallest positive integer ν\nu satisfying

μννlog(1+pas2)n.\mu_{\nu}\leq\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.
Definition 5.

For a>0a>0, define Γ(s,a)\Gamma_{\mathcal{H}}(s,a) to be

Γ(s,a):=maxν{μννlog(1+pas2)n}.\Gamma_{\mathcal{H}}(s,a):=\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}.

Note ν(s,a)\nu_{\mathcal{H}}(s,a) is decreasing in aa and Γ(s,a)\Gamma_{\mathcal{H}}(s,a) is increasing in aa. As discussed in Section 2.2, two different forms are useful when discussing the separation rate, and Lemma 1 facilitated that discussion. The following lemma is a slight generalization and has the same purpose.

Lemma 2.

Suppose a>0a>0. If log(1+pas2)n2\log\left(1+\frac{pa}{s^{2}}\right)\leq\frac{n}{2}, then

Γ(s,a)ν(s,a)log(1+pas2)n2Γ(s,a).\Gamma_{\mathcal{H}}(s,a)\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\sqrt{2}\Gamma_{\mathcal{H}}(s,a).

At a high level, the central issue with adaptation lies in the selection of an estimator’s or a test’s hyperparameter (such as the bandwidth or penalty). Typically, there is some optimal choice but it requires knowledge of an unknown parameter (e.g. smoothness or sparsity) in order to pick it. In the current problem, the optimal choice of ν\nu is unknown since the sparsity level ss is unknown.

In adaptive testing, the typical strategy is to fix a grid of different values, construct a test for each potential value of the hyperparameter, and for detection use the test which takes the maximum over this collection of tests. Typically, a geometric grid is chosen, and the logarithm of the grid’s cardinality reflects a cost paid by the testing procedure [41, 34, 25]. It turns out this high level intuition holds for signal detection in sparse additive models, but the details of how to select the grid are not direct since a generic \mathcal{H} must be accommodated. For a>0a>0, define

𝒱a:={2k:k{0} and 2k1<ν(s,a)2k for some s[p]}.\mathscr{V}_{a}:=\left\{2^{k}:k\in\mathbb{N}\cup\{0\}\text{ and }2^{k-1}<\nu_{\mathcal{H}}(s,a)\leq 2^{k}\text{ for some }s\in[p]\right\}.

Define

𝒜:=sup{a1:log(e|𝒱a|)a}\mathscr{A}_{\mathcal{H}}:=\sup\left\{a\geq 1:\log(e|\mathscr{V}_{a}|)\geq a\right\} (15)

and define

𝒱:=𝒱𝒜.\mathscr{V}_{\mathcal{H}}:=\mathscr{V}_{\mathscr{A}_{\mathcal{H}}}. (16)

The grid 𝒱\mathscr{V}_{\mathcal{H}} will be used. It is readily seen that 𝒜\mathscr{A}_{\mathcal{H}} is finite as the crude bound 𝒜log(ep)\mathscr{A}_{\mathcal{H}}\leq\log(ep) is immediate. The following lemma shows that 𝒜\mathscr{A}_{\mathcal{H}} is, in essence, a fixed point.

Lemma 3.

If log(1+p𝒜)n2\log\left(1+p\mathscr{A}_{\mathcal{H}}\right)\leq\frac{n}{2}, then 𝒜log(e|𝒱|)2𝒜\mathscr{A}_{\mathcal{H}}\leq\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}}.

It turns out an adaptive testing procedure can be constructed which pays a price determined by 𝒜\mathscr{A}_{\mathcal{H}} (equivalently log(e|𝒱|)\log(e|\mathscr{V}_{\mathcal{H}}|) up to constant factors). To elaborate, assume log(1+p𝒜)n/2\log\left(1+p\mathscr{A}_{\mathcal{H}}\right)\leq n/2. We will construct an adaptive test which achieves

{snlog(1+p𝒜s2)+sΓ(s,𝒜)if s<p𝒜,sΓ(s,𝒜)if sp𝒜.\begin{cases}\frac{s}{n}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)+s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}},\\ s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}.\end{cases}

The form of this separation rate is very similar to the minimax rate, except that the phase transition has shifted to sp𝒜s\asymp\sqrt{p\mathscr{A}_{\mathcal{H}}} and a cost involving 𝒜\mathscr{A}_{\mathcal{H}} is incurred. The value of 𝒜\mathscr{A}_{\mathcal{H}} can vary quite a bit with the choice of \mathcal{H}, and a few examples are illustrated in Section 3.4.

Our choice of the geometric grid yielding the fixed point characterization of Lemma 3 may not be immediately intuitive. It can be understood at a high level by noticing there are two competing factors at play. First, fix a1a\geq 1 and note it can be conceptualized to represent a target cost from scanning over some grid. To achieve that target cost, we should use the truncation levels ν(s,a)\nu_{\mathcal{H}}(s,a) for each sparsity ss. Now, consider that we will scan over the geometric grid 𝒱a\mathscr{V}_{a} obtained from the truncation levels. If that geometric grid has cardinality which would incur a cost less than aa, then it would appear strange that we are scanning over a smaller grid to achieve a target cost associated to a larger grid. It is intuitive that we might try to aim for a lower target cost a<aa^{\prime}<a instead. But this changes our grid to 𝒱a\mathscr{V}_{a^{\prime}} which now has a different cardinality, and so the same logic applies to aa^{\prime}. A similar line of reasoning applies if the grid had happened to be too large. Therefore, we intuitively want to select aa such that 𝒱a\mathscr{V}_{a} has the proper cardinality to push the cost aa upon us. Hence, we are seeking a kind of fixed point in this sense.

3.2 Lower bound

The formulation of a lower bound is more delicate than one might initially anticipate. To illustrate the subtlety, suppose we have a candidate adaptive rate sϵ(s)s\mapsto\epsilon(s). It is tempting to consider the following lower bound formulation; for any η(0,1)\eta\in(0,1), there exists cη>0c_{\eta}>0 such that for any 0<c<cη0<c<c_{\eta},

infφ{P0{φ0}+max1spsupfs,f2cϵ(s)Pf{φ1}}1η.\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{1\leq s\leq p}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\epsilon(s)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta.

While extremely natural, this criterion has an issue. In particular, if there exists s¯\bar{s} such that ϵ(s¯)ε(s¯)\epsilon(\bar{s})\asymp\varepsilon^{*}(\bar{s}) where sε(s)s\mapsto\varepsilon^{*}(s) denotes the minimax rate, then the above criterion is trivially satisfied. One simply lower bounds the maximum over all ss with the choice s=s¯s=\bar{s} and then invokes the minimax lower bound. To exaggerate, the candidate rate ϵ(1)=ε(1)\epsilon(1)=\varepsilon^{*}(1) and ϵ(s)=\epsilon(s)=\infty for s2s\geq 2 satisfies the above criterion. Furthermore, from an upper bound perspective there is a trivial test which achieves this candidate rate. The absurdity from this lower bound criterion would then force us to conclude this candidate rate gives an adaptive testing rate.

To avoid such absurdities, we use a different lower bound criterion. The key issue in the formulation is that, in the domain of maximization for the Type II error, one must not include any ss for which the candidate rate is of the same order as the minimax rate.

Definition 6.

Suppose sεadapt(p,s,n)s\mapsto\varepsilon_{\text{adapt}}(p,s,n) is a candidate adaptive rate and {𝒮p,n}p,n\left\{\mathcal{S}_{p,n}\right\}_{p,n\in\mathbb{N}} are a collection of sets satisfying 𝒮p,n[p]\mathcal{S}_{p,n}\subset[p] for all pp and nn. We say sεadapt(p,s,n)s\mapsto\varepsilon_{\text{adapt}}(p,s,n) satisfies the adaptive lower bound criterion with respect to {𝒮p,n}p,n\left\{\mathcal{S}_{p,n}\right\}_{p,n\in\mathbb{N}} if the following two conditions hold.

  1. (i)

    For any η(0,1)\eta\in(0,1), there exists cη>0c_{\eta}>0 depending only on η\eta such that for all 0<c<cη0<c<c_{\eta},

    infφ{P0{φ0}+maxs𝒮p,nsupfs,f2cεadapt(s)Pf{φ1}}1η\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s\in\mathcal{S}_{p,n}}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\varepsilon_{\text{adapt}}(s)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

    for all pp and nn.

  2. (ii)

    Either

    maxp,nmins𝒮p,nεadapt(p,s,n)ε(p,s,n)=\max_{p,n\in\mathbb{N}}\min_{s\in\mathcal{S}_{p,n}}\frac{\varepsilon_{\text{adapt}}(p,s,n)}{\varepsilon^{*}(p,s,n)}=\infty (17)

    or

    maxp,nmaxs[p]εadapt(p,s,n)ε(p,s,n)<\max_{p,n\in\mathbb{N}}\max_{s\in[p]}\frac{\varepsilon_{\text{adapt}}(p,s,n)}{\varepsilon^{*}(p,s,n)}<\infty (18)

    where sε(p,s,n)s\mapsto\varepsilon^{*}(p,s,n) denotes the minimax separation rate.

Intuitively, this criterion is nontrivial only when there are no sparsity levels in the chosen reference sets for which the candidate rate is the same order as the minimax rate. More explicitly, there is no sequence {sp,n}p,n\{s_{p,n}\}_{p,n\in\mathbb{N}} with sp,n𝒮p,ns_{p,n}\in\mathcal{S}_{p,n} along which the candidate and minimax rates match (up to constants). This criterion avoids trivialities such as the one described earlier. In Definition 6, formalizing the notion of “order” requires speaking in terms of sequences, though it may appear unfamiliar and clunky at first glance. Definition 6 is not new, but rather a direct port to the testing context of the notion of an adaptive rate of convergence on the scale of classes from [44] in the estimation setting (see also [11] for an application of this notion to linear functional estimation in the sparse Gaussian sequence model).

The condition (18) simply requires the candidate rate to match the minimax rate if (17) does not hold. In this way, the definition allows for the possibility of the minimax rate to be an adaptive rate (in those cases where no cost for adaptation needs to be paid).

We now state a lower bound with respect to this criterion. Define the candidate rate

ψadapt(p,s,n)2:={snlog(1+p𝒜s2)sΓ(s,𝒜)if s<p𝒜,sΓ(s,𝒜)if sp𝒜.\psi_{\text{adapt}}(p,s,n)^{2}:=\begin{cases}\frac{s}{n}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\vee s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}},\\ s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}.\end{cases} (19)

Here, 𝒜\mathscr{A}_{\mathcal{H}} is defined in (15). Examining this candidate, it is possible one may run into absurdities for s<(p𝒜)1/2δs<(p\mathscr{A}_{\mathcal{H}})^{1/2-\delta} since log(1+p𝒜s2)logp\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\asymp\log p, meaning the candidate rate can match the minimax rate. Note here we have used that 𝒜\mathscr{A}_{\mathcal{H}} grows at most logarithmically in pp. Therefore, we will take 𝒮={s[p]:sp𝒜}\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\} (dropping the subscripts and writing 𝒮=𝒮p,n\mathcal{S}=\mathcal{S}_{p,n} for notational ease).

It follows from the fact that Γ(s,a)\Gamma_{\mathcal{H}}(s,a) is increasing in aa that either (17) or (18) should typically hold. To see that Γ(s,a)\Gamma_{\mathcal{H}}(s,a) is indeed increasing in aa, let us fix aaa\leq a^{\prime}. Consider that for every ν\nu\in\mathbb{N} we have

μννlog(1+pas2)nμννlog(1+pas2)n.\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa^{\prime}}{s^{2}}\right)}}{n}.

Taking maximum over ν\nu on both sides yields Γ(s,a)Γ(s,a)\Gamma_{\mathcal{H}}(s,a)\leq\Gamma_{\mathcal{H}}(s,a^{\prime}), i.e. Γ\Gamma_{\mathcal{H}} is increasing in aa. Now, consider a=1a=1 and a=𝒜a^{\prime}=\mathscr{A}_{\mathcal{H}} in order to compare the candidate rate ψadapt\psi_{\text{adapt}} to the minimax rate ε\varepsilon^{*}. Since sp𝒜s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}, we have log(1+ps2)ps2\log\left(1+\frac{p}{s^{2}}\right)\asymp\frac{p}{s^{2}} and log(1+p𝒜s2)p𝒜s2\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\asymp\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}. With the above display, it is intuitive that either (17) or (18) should hold in typical nonpathological cases; these conditions can be easily checked on a problem by problem basis (such as those in Section 3.4).

In preparation for the statement of the lower bound, the following set is needed

𝒱~:={2k:k{0} and 2k1<ν(s,𝒜)2k for some s𝒮}.\widetilde{\mathscr{V}}_{\mathcal{H}}:=\left\{2^{k}:k\in\mathbb{N}\cup\left\{0\right\}\text{ and }2^{k-1}<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\leq 2^{k}\text{ for some }s\in\mathcal{S}\right\}. (20)

With 𝒱~\widetilde{\mathscr{V}}_{\mathcal{H}} defined, we are ready to state the lower bound. The following result establishes condition (i) of Definition 6 for ψadapt\psi_{\text{adapt}} given by (19) with respect to our choice 𝒮={s[p]:sp𝒜}\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}.

Theorem 2.

Suppose 𝒜log(e|𝒱~|)\mathscr{A}_{\mathcal{H}}\asymp\log\left(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|\right). Assume further log(1+p𝒜)n2\log\left(1+p\mathscr{A}_{\mathcal{H}}\right)\leq\frac{n}{2}. If η(0,1)\eta\in\left(0,1\right), then there exists cη>0c_{\eta}>0 depending only on η\eta such that for all 0<c<cη0<c<c_{\eta}, we have

infφ{P0{φ0}+maxs𝒮supfs,f2cψadaptPf{φ1}}1η\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s\in\mathcal{S}}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\psi_{\text{adapt}}\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where ψadapt\psi_{\text{adapt}} is given by (19) and 𝒮={s[p]:sp𝒜}\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}.

The condition 𝒜log(e|𝒱~|)\mathscr{A}_{\mathcal{H}}\asymp\log\left(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|\right) is somewhat mild; for example it is satisfied when the eigenvalues satisfy a polynomial decay such as μk=cαk2α\mu_{k}=c_{\alpha}k^{-2\alpha} for some α>0\alpha>0 and positive universal constant cαc_{\alpha}. It is also satisfied when there is exponential decay such as μk=c1ec2jγ\mu_{k}=c_{1}e^{-c_{2}j^{\gamma}} for some γ>0\gamma>0 and positive universal constants c1c_{1} and c2c_{2}. Sections 3.4 contains further discussion of these two cases.

The proof strategy is, at a high level, the same as that in Section 2.3 with the added complication that the prior should also randomize over the sparsity level in order to capture the added difficulty of not knowing the sparsity level. The random sparsity level is drawn by drawing a random truncation point ν\nu and extracting the induced sparsity. The rest of the prior construction is similar to that in Section 2.3, but the analysis is complicated by the random sparsity level.

Finally, note Theorem 2 is a statement about a cost paid for adaptation for sparsity levels in 𝒮={s[p]:sp𝒜}\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}. Nothing is asserted about smaller sparsities. It would be interesting to show whether or not the candidate ψadapt\psi_{\text{adapt}} also captures the cost for sparsity levels ss which satisfy (p𝒜)1/2δ<s<p𝒜(p\mathscr{A}_{\mathcal{H}})^{1/2-\delta}<s<\sqrt{p\mathscr{A}_{\mathcal{H}}} for all δ(0,1/2]\delta\in(0,1/2] (e.g. sp𝒜/log(p𝒜)s\asymp\sqrt{p\mathscr{A}_{\mathcal{H}}}/\log(p\mathscr{A}_{\mathcal{H}})); we leave it open for future work.

3.3 Upper bound

In this section, an adaptive test is constructed. Recall 𝒜\mathscr{A}_{\mathcal{H}} and 𝒱\mathscr{V}_{\mathcal{H}} given in (15) and (16) respectively. Define

τadapt2(p,s,n)={(snlog(1+p𝒜s2))(snν(s,𝒜)log(1+p𝒜s2))if s<p𝒜,snν(s,𝒜)log(1+p𝒜s2)if sp𝒜.\tau_{\text{adapt}}^{2}(p,s,n)=\begin{cases}\left(\frac{s}{n}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\vee\left(\frac{s}{n}\sqrt{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\right)&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}},\\ \frac{s}{n}\sqrt{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}.\end{cases} (21)

In the test, the following collection of geometric grids is used. Define

𝒮:={1,2,4,,2log2p𝒜1}{p}.\mathscr{S}:=\left\{1,2,4,...,2^{\left\lceil\log_{2}\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\rceil-1}\right\}\cup\{p\}. (22)

The choice to take 𝒮\mathscr{S} as a geometric grid is not statistically critical, but doing so is computationally advantageous.

Theorem 3.

There exist universal positive constants D,K2,K2,D,K_{2},K_{2}^{\prime}, and K3K_{3} such that the following holds. Let 𝒜\mathscr{A}_{\mathcal{H}}, 𝒱\mathscr{V}_{\mathcal{H}}, and 𝒮\mathscr{S} be given by (15), (16), and (22) respectively. For ν𝒱\nu\in\mathscr{V}_{\mathcal{H}} and s[p]s\in[p], set dν=νDd_{\nu}=\nu\vee\lceil D\rceil and set

rν,s\displaystyle r_{\nu,s} :=K2(dνlog(1+p𝒜s2))1/4,\displaystyle:=K_{2}\left(d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)^{1/4},
rs\displaystyle r_{s}^{\prime} :=K2log(1+p𝒜s2).\displaystyle:=K_{2}^{\prime}\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}.

If η(0,1)\eta\in(0,1), then there exists Cη>0C_{\eta}>0 depending only on η\eta such that for all C>CηC>C_{\eta}, we have

P0{maxν𝒱maxs𝒮φν,s0}+max1spsupfs,f2Cτadapt(p,s,n)Pf{maxν𝒱maxs𝒮φν,s1}ηP_{0}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}\neq 0\right\}+\max_{1\leq s^{*}\leq p}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s^{*}},\\ ||f||_{2}\geq C\tau_{\text{adapt}}(p,s^{*},n)\end{subarray}}P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}\neq 1\right\}\leq\eta

where τadapt\tau_{\text{adapt}} is given by (21). Here, φν,s\varphi_{\nu,s} is given by

φν,s:={𝟙{Trν,s(dν)>Csνlog(1+p𝒜s2)}if s<p𝒜 and log(1+p𝒜s2)K3dν,𝟙{Trs(dν)>Cslog(1+p𝒜s2)}if s<p𝒜 and log(1+p𝒜s2)>K3dν,𝟙{nj=1pkνXk,j2>νp+Cνp𝒜}if sp𝒜\varphi_{\nu,s}:=\begin{cases}\mathbbm{1}_{\left\{T_{r_{\nu,s}}(d_{\nu})>Cs\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\right\}}&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\textit{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\leq K_{3}\sqrt{d_{\nu}},\\ \mathbbm{1}_{\left\{T_{r_{s}^{\prime}}(d_{\nu})>Cs\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right\}}&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\textit{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}>K_{3}\sqrt{d_{\nu}},\\ \mathbbm{1}_{\left\{n\sum_{j=1}^{p}\sum_{k\leq\nu}X_{k,j}^{2}>\nu p+C\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\right\}}&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{cases}

and the statistic Tr(d)T_{r}(d) is given by (12).

As mentioned earlier, the adaptive test involves scanning over the potential values of ν\nu in the geometric grid 𝒱\mathscr{V}_{\mathcal{H}}. Consequently, a cost involving 𝒜\mathscr{A}_{\mathcal{H}} is paid. Note for s<p𝒜s<\sqrt{p\mathscr{A}_{\mathcal{H}}} we also need to scan over s𝒮s\in\mathscr{S} in order to set the thresholds in the statistics Tr(d)T_{r}(d) properly. It turns out this extra scan does not incur a cost; cost is driven by needing to scan over ν\nu.

To understand how scanning over ν\nu can incur a cost, it is most intuitive to consider the need to control the Type I error when scanning over the χ2\chi^{2} statistics with various degrees of freedom. Roughly speaking, it is necessary to pick the threshold values tνt_{\nu} such that the Type I error P(ν𝒱{χνp2νp>tν})P\left(\bigcup_{\nu\in\mathscr{V}_{\mathcal{H}}}\left\{\chi^{2}_{\nu p}-\nu p>t_{\nu}\right\}\right) is smaller than some prescribed error level. By union bound and (14), consider the tail bound P(ν𝒱{χνp2νp>2νpu+2u})ν𝒱P{χνp2νp>2νpu+2u}|𝒱|euP\left(\bigcup_{\nu\in\mathscr{V}_{\mathcal{H}}}\left\{\chi^{2}_{\nu p}-\nu p>2\sqrt{\nu pu}+2u\right\}\right)\leq\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P\left\{\chi^{2}_{\nu p}-\nu p>2\sqrt{\nu pu}+2u\right\}\leq|\mathscr{V}_{\mathcal{H}}|e^{-u}. To ensure the Type I error is small, this tail bound suggests the choice tννplog|𝒱|+2log|𝒱|νplog|𝒱|t_{\nu}\asymp\sqrt{\nu p\log|\mathscr{V}_{\mathcal{H}}|}+2\log|\mathscr{V}_{\mathcal{H}}|\asymp\sqrt{\nu p\log|\mathscr{V}_{\mathcal{H}}|} where the latter equivalence follows from the fact |𝒱||\mathscr{V}_{\mathcal{H}}| grows at most logarithmically in pp. Since the threshold tνt_{\nu} is inflated, a signal must also have larger norm in order to be detected. The logarithmic inflation factor is a consequence of the χ2\chi^{2} tail.

3.4 Special cases

To illustrate how a price for adaptation depends delicately on the choice of function space \mathcal{H}, we consider a a variety of cases. Throughout, the assumption log(1+p𝒜)n/2\log(1+p\mathscr{A}_{\mathcal{H}})\leq n/2 as well as the other assumptions necessary for our result are made.

3.4.1 Sobolev

Taking μkk2α\mu_{k}\asymp k^{-2\alpha} as emblematic of Sobolev space with smoothness α\alpha, it can be shown that 𝒜log(e|𝒱~|)loglogp\mathscr{A}_{\mathcal{H}}\asymp\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\asymp\log\log p (recall (15) and (20)). Our upper and lower bounds assert an adaptive testing rate is given by

εadapt(p,s,n)2{slog(ploglogps2)n+s(nlog(ploglogps2))4α4α+1if s<ploglogp,s(nsploglogp)4α4α+1if sploglogp.\varepsilon_{\text{adapt}}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p\log\log p}{s^{2}}\right)}{n}+s\left(\frac{n}{\sqrt{\log\left(\frac{p\log\log p}{s^{2}}\right)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s<\sqrt{p\log\log p},\\ s\left(\frac{ns}{\sqrt{p\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s\geq\sqrt{p\log\log p}.\end{cases}

Ingster and Lepski [25] consider adaptation in the sparse regime s=O(p1/2δ)s=O(p^{1/2-\delta}) and the dense regime s=p1/2+δs=p^{1/2+\delta} for δ(0,1/2)\delta\in(0,1/2) separately. In the sparse regime, they construct an adaptive test which is able to achieve the minimax rate; no cost is paid. In the dense rate, not only do they give a test which achieves s(nsploglogp)4α4α+1s\left(\frac{ns}{\sqrt{p\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}, they also supply a lower bound (now, in our lower bound notation, 𝒮=[p]\mathcal{S}=[p] can be taken) showing it cannot be improved. As seen in the above display, our test enjoys optimality in these regimes.

The reader should take care when comparing [25] to our result. In our setting, the unknown sparsity level can vary throughout the entire range s[p]s\in[p]. In contrast, [25] consider two separate cases. Either the unknown sparsity ss is constrained to sp1/2δs\leq p^{1/2-\delta}, or it is constrained to sp1/2+δs\geq p^{1/2+\delta}. To elaborate, the precise value of ss is unknown but [25] assumes the statistician has knowledge of which of the two cases is in force. In contrast, we make no such assumption; nothing at all is known about the sparsity level. In our view, this situation is more faithful to the problem facing the practitioner. Despite constructed for a seemingly more difficult setting, it turns out our test is optimal under Ingster and Lepski’s setting.

Our result provides finer detail around p\sqrt{p} missed by [25]. In Theorem 3 of their article, Ingster and Lepski propose an adaptive test procedure which is applicable in the regime sps\asymp\sqrt{p}. It requires signal strength of squared order at least

p(nloglogp)4α4α+1.\sqrt{p}\left(\frac{n}{\sqrt{\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

This is suboptimal since our test achieves

{p(nlogloglogp)4α4α+1if logloglogpn12α+1,plogloglogpnif logloglogp>n12α+1,\begin{cases}\sqrt{p}\left(\frac{n}{\sqrt{\log\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log\log\log p\leq n^{\frac{1}{2\alpha+1}},\\ \frac{\sqrt{p}\log\log\log p}{n}&\textit{if }\log\log\log p>n^{\frac{1}{2\alpha+1}},\end{cases}

which is faster.

3.4.2 Finite dimension

Consider a finite dimensional structure 1=μ1=μ2==μm>μm+1==01=\mu_{1}=\mu_{2}=...=\mu_{m}>\mu_{m+1}=...=0 where mm is a positive integer. If m<n2log(1+p)m<\frac{n^{2}}{\log\left(1+p\right)}, it can be shown 𝒜1\mathscr{A}_{\mathcal{H}}\asymp 1, and so the minimax separation rate can be achieved by our adaptive test. In other words, no cost is paid for adaptation.

3.4.3 Exponential decay

Consider exponential decay of the eigenvalues μk=c1ec2kγ\mu_{k}=c_{1}e^{-c_{2}k^{\gamma}} where γ>0\gamma>0. It can be shown that 𝒜log(e|𝒱~|)logloglogp\mathscr{A}_{\mathcal{H}}\asymp\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\asymp\log\log\log p, and so an adaptive rate is

εadapt(p,s,n)2{slog(plogloglogps2)n+slog12γ(nlog(plogloglogps2))log(plogloglogps2)nif s<plogloglogp,plogloglogpnlog12γ(nsplogloglogp)if splogloglogp.\varepsilon_{\text{adapt}}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p\log\log\log p}{s^{2}}\right)}{n}+s\log^{\frac{1}{2\gamma}}\left(\frac{n}{\sqrt{\log\left(\frac{p\log\log\log p}{s^{2}}\right)}}\right)\cdot\frac{\sqrt{\log\left(\frac{p\log\log\log p}{s^{2}}\right)}}{n}&\textit{if }s<\sqrt{p\log\log\log p},\\ \frac{\sqrt{p\log\log\log p}}{n}\log^{\frac{1}{2\gamma}}\left(\frac{ns}{\sqrt{p\log\log\log p}}\right)&\textit{if }s\geq\sqrt{p\log\log\log p}.\end{cases}

The cost for adaptation here grows very slowly, and is perhaps another indication of the relative “small” size of RKHSs based on Gaussian kernels (as noted in Section 2.5).

4 Adaptation to both sparsity and smoothness

So far we have assumed that the space \mathcal{H} is known. However, it is likely that \mathcal{H} is one constituent out of a collection of spaces indexed by some hyperparameter, such as a smoothness level. Typically, the true value of this hyperparameter is unknown in addition to the sparsity being unknown, and it is of interest to understand how the testing rate changes. To avoid excessive abstractness, we follow [25, 17] and adopt a working model where 𝒯s\mathscr{T}_{s} has μk\mu_{k} of the form μk=k2α\mu_{k}=k^{-2\alpha} emblematic of Sobolev space with smoothness α\alpha. To emphasize the dependence on α\alpha, let us write 𝒯(s,α)\mathscr{T}(s,\alpha) and its corresponding space (s,α)\mathcal{F}(s,\alpha). Reiterating, we are interested in adapting to both the sparsity level ss and the smoothness level α\alpha.

Ingster and Lepski [25] study adaptation to both sparsity and smoothness. In particular, they study adaptation for an unknown α[α0,α1]\alpha\in[\alpha_{0},\alpha_{1}] in a closed interval where the endpoints α0<α1\alpha_{0}<\alpha_{1} are known. As argued in [25], since α1\alpha_{1} is known, the sequence space basis in Section 1.2.2 for any α[α0,α1]\alpha\in[\alpha_{0},\alpha_{1}] can be taken to be α1\alpha_{1}-regular, and thus is known to the statistician. Furthermore, they separately study the dense regime sp1/2+δs\geq p^{1/2+\delta} and the sparse regime s<p1/2δs<p^{1/2-\delta} for some constant δ(0,1/2)\delta\in(0,1/2). We adopt the same setup as them.

Ingster and Lepski [25] claim the following adaptive rate. In the dense case sp1/2+δs\geq p^{1/2+\delta}, they make the assumption ps2log(n2s2/plog(n2s2/p))0\frac{p}{s^{2}}\log\left(\frac{n^{2}s^{2}/p}{\log(n^{2}s^{2}/p)}\right)\to 0 and state (Theorem 4 in [25])

s(nsplog(nsp))4α4α+1.s\left(\frac{ns}{\sqrt{p\log\left(\frac{ns}{\sqrt{p}}\right)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

In contrast111The same error affects the proofs of the lower bounds in Theorems 4 and 5 in [25]. The prior they define is not supported on the parameter space. Under their prior (see (75) in [25]), the various coordinate functions fjf_{j} end up having different smoothness levels αj\alpha_{j} instead of sharing a single α\alpha., we will show the adaptive rate in the dense case sp1/2+δs\geq p^{1/2+\delta} is

εdense2(p,s,n,α)s(nsploglog(np))4α4α+1.\varepsilon_{\text{dense}}^{2}(p,s,n,\alpha)\asymp s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

The careful reader will note that in the special case p=1p=1, our answer recovers the adaptive separation rate (n/loglogn)4α4α+1\left(n/\sqrt{\log\log n}\right)^{-\frac{4\alpha}{4\alpha+1}} proved by Spokoiny [41]. In the sparse case s<p1/2δs<p^{1/2-\delta}, their claimed rate (Theorem 5 in [25]) is

slogpn+s(nlog(plogn))4α4α+1.\frac{s\log p}{n}+s\left(\frac{n}{\sqrt{\log(p\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}. (23)

As a partial correction, we prove the following lower bound

slog(ploglogn)n+s(nlog(ploglogn))4α4α+1.\frac{s\log(p\log\log n)}{n}+s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}. (24)

In the regime logploglogn\log p\gtrsim\log\log n, (23) and (24) match each other and match the minimax rate (see Section 2.5.1). However, in the case logploglogn\log p\lesssim\log\log n, there may be a difference between (23) and (24).

4.1 Dense

Fix δ(0,1/2)\delta\in(0,1/2) and α0<α1\alpha_{0}<\alpha_{1}. Recall in the dense regime that sp1/2+δs\geq p^{1/2+\delta}. Define

τdense2(p,s,n,α):=s(nsploglog(np))4α4α+1.\tau_{\text{dense}}^{2}(p,s,n,\alpha):=s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}. (25)

As mentioned earlier, we will be establishing this rate in the dense regime. For use in a testing procedure, define the geometric grid

𝒱test:={1,2,4,8,2log2((npploglog(np))24α0+1)}.\mathcal{V}_{\text{test}}:=\left\{1,2,4,8,...2^{\left\lceil\log_{2}\left(\left(\frac{np}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha_{0}+1}}\right)\right\rceil}\right\}. (26)

Note the statistician does not need to know ss nor α\alpha to construct 𝒱test\mathcal{V}_{\text{test}}. Further note log|𝒱test|loglog(np)\log|\mathcal{V}_{\text{test}}|\asymp\log\log(np).

4.1.1 Upper bound

The test procedure employed follows the typical strategy in adaptive testing, namely constructing individual tests for potential values of ν\nu in the grid 𝒱test\mathcal{V}_{\text{test}}, and then taking maximum over the individual tests to perform signal detection. The cost of adaptation involves the logarithm of the cardinality of 𝒱test\mathcal{V}_{\text{test}} (i.e. loglog(np)\log\log(np)). Since the dense regime is in force, the individual tests are χ2\chi^{2} tests.

Theorem 4.

Fix δ(0,1/2)\delta\in(0,1/2) and α0<α1\alpha_{0}<\alpha_{1}. If η(0,1)\eta\in(0,1), then there exists Cη>0C_{\eta}>0 depending only on η\eta such that for all C>CηC>C_{\eta}, we have

P0{maxν𝒱testφν=1}+maxsp1/2+δsupα[α0,α1]supf(s,α),f2Cτdense(p,s,n,α)Pf{maxν𝒱testφν=0}ηP_{0}\left\{\max_{\nu\in\mathcal{V}_{\text{test}}}\varphi_{\nu}=1\right\}+\max_{s\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq C\tau_{\text{dense}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\max_{\nu\in\mathcal{V}_{\text{test}}}\varphi_{\nu}=0\right\}\leq\eta

where

φν=𝟙{nj=1pkνXk,j2νp+Kη(νploglog(np)+loglog(np))}.\varphi_{\nu}=\mathbbm{1}_{\left\{n\sum_{j=1}^{p}\sum_{k\leq\nu}X_{k,j}^{2}\geq\nu p+K_{\eta}\left(\sqrt{\nu p\log\log(np)}+\log\log(np)\right)\right\}}.

with KηK_{\eta} being a constant depending only on η\eta. Here τdense\tau_{\text{dense}} is given by (25) and 𝒱test\mathcal{V}_{\text{test}} is given by (26).

4.1.2 Lower bound

The following theorem states the lower bound. Note that it satisfies Definition 6 with the straightforward modification to incorporate adaptation over α\alpha. In particular, the potential difficulty with absurdities outlined in Section 3.2 do not arise since the candidate rate

τdense2(p,s,n,α)=s(nsploglog(np))4α4α+1\tau^{2}_{\text{dense}}(p,s,n,\alpha)=s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}

is never of the same order as the minimax rate ε(p,s,n,α)2=s(nsp)4α4α+1\varepsilon^{*}(p,s,n,\alpha)^{2}=s\left(\frac{ns}{\sqrt{p}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

Theorem 5.

Fix δ(0,1/2)\delta\in(0,1/2) and α0<α1\alpha_{0}<\alpha_{1}. If η(0,1)\eta\in(0,1), then there exists cη>0c_{\eta}>0 depending only on η\eta such that for all 0<c<cη0<c<c_{\eta}, we have

infφ{P0{φ0}+maxsp1/2+δsupα[α0,α1]supf(s,α),f2cτdense(p,s,n,α)Pf{φ1}}1η\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq c\tau_{\text{dense}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where τdense\tau_{\text{dense}} is given by (25).

4.2 Sparse

Fix δ(0,1/2)\delta\in(0,1/2) and α0<α1\alpha_{0}<\alpha_{1}. Recall that in the sparse regime we have s<p1/2δs<p^{1/2-\delta}. Define

τsparse2(p,s,n,α):=slog(ploglogn)n+s(nlog(ploglogn))4α4α+1.\tau_{\text{sparse}}^{2}(p,s,n,\alpha):=\frac{s\log\left(p\log\log n\right)}{n}+s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}. (27)

Ingster and Lepski [25] give a testing procedure achieving (23) and thus establish the upper bound. As noted earlier, when logploglogn\log p\gtrsim\log\log n, not only do (27) and (23) match each other but they also match the minimax rate (see Section 2.5.1). In the regime logploglogn\log p\lesssim\log\log n, it is not clear from a lower bound perspective whether a cost for adaptation is unavoidable. We complement their upper bound by providing a lower bound which demonstrates that it is indeed necessary to pay a cost for adaptation. However, the cost we identify may not be sharp. To elaborate, in the case where pp is a large universal constant but much smaller than nn, the sparse regime rate (27) involves logloglogn\sqrt{\log\log\log n}. From Spokoiny’s article [41], loglogn\sqrt{\log\log n} is instead expected. We leave it for future work to pin down the sharp cost.

Theorem 6.

Fix δ(0,1/2)\delta\in(0,1/2) and α0<α1\alpha_{0}<\alpha_{1}. If η(0,1)\eta\in(0,1), then there exists cη>0c_{\eta}>0 depending only on η\eta such that for all 0<c<cη0<c<c_{\eta}, we have

infφ{P0{φ0}+maxs<p1/2δsupα[α0,α1]supf(s,α),f2cτsparse(p,s,n,α)Pf{φ1}}1η\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s<p^{1/2-\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq c\tau_{\text{sparse}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where τsparse\tau_{\text{sparse}} is given by (27).

As discussed with respect to Theorem 5, the lower bound in Theorem 6 satisfies Definition 6 with the straightforward modification to incorporate adaptation over α\alpha.

5 Discussion

5.1 Relaxing the centered assumption

In Section 1.2.2, in the definition of the parameter space (4) the signal ff was constrained to be centered. It was claimed this constraint is mild and can be relaxed; this section elaborates on this point.

Suppose ff is an additive function (possibly uncentered) with each fjf_{j}\in\mathcal{H}. Let f¯=[0,1]pf(x)𝑑x\bar{f}=\int_{[0,1]^{p}}f(x)\,dx and note we can write f=f¯𝟏p+(ff¯𝟏p)f=\bar{f}\mathbf{1}_{p}+(f-\bar{f}\mathbf{1}_{p}) where 𝟏pL2([0,1]p)\mathbf{1}_{p}\in L^{2}([0,1]^{p}) is the constant function equal to one. Since ff is an additive function, for any x[0,1]px\in[0,1]^{p} we have f(x)=f¯+g(x)f(x)=\bar{f}+g(x) where g(x)=j=1pgj(xj)g(x)=\sum_{j=1}^{p}g_{j}(x_{j}) with gj(xj)=fj(xj)01fj(t)𝑑tg_{j}(x_{j})=f_{j}(x_{j})-\int_{0}^{1}f_{j}(t)\,dt. In other words, we have f=f¯𝟏p+gf=\bar{f}\mathbf{1}_{p}+g where gg itself is an additive function with each gj0g_{j}\in\mathcal{H}_{0}.

Asserting ff is a sparse additive function implies gg is a (centered) sparse additive function. Following [38, 17], we impose constraints (i.e. RKHS norm constraints) on the centered component functions gjg_{j}. Therefore, the following uncentered signal detection problem can be considered,

H0\displaystyle H_{0} :f0,\displaystyle:f\equiv 0, (28)
H1\displaystyle H_{1} :f2εf is s-sparse additive, and gs\displaystyle:||f||_{2}\geq\varepsilon\text{, }f\text{ is }s\text{-sparse additive, and }g\in\mathcal{F}_{s} (29)

where s\mathcal{F}_{s} is given by (4). Let ε\varepsilon^{*} denote the minimax separation rate. Consider f22=f¯2+g22||f||_{2}^{2}=\bar{f}^{2}+||g||_{2}^{2} by orthogonality. Thus, there are two related detection problems, Problem I

H0\displaystyle H_{0} :f0,\displaystyle:f\equiv 0,
H1\displaystyle H_{1} :|f¯|ε1 and f is s-sparse additive\displaystyle:|\bar{f}|\geq\varepsilon_{1}\text{ and }f\text{ is }s\text{-sparse additive}

and Problem II

H0\displaystyle H_{0} :f0,\displaystyle:f\equiv 0,
H1\displaystyle H_{1} :g2ε2 and gs.\displaystyle:||g||_{2}\geq\varepsilon_{2}\text{ and }g\in\mathcal{F}_{s}.

Let ε1\varepsilon_{1}^{*} and ε2\varepsilon_{2}^{*} denote the minimax separation rates of the respective problems. We claim (ε)2(ε1)2+(ε2)2(\varepsilon^{*})^{2}\asymp(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}. From f2=f¯2+g22f¯2g22||f||^{2}=\bar{f}^{2}+||g||_{2}^{2}\geq\bar{f}^{2}\vee||g||_{2}^{2} it is directly seen that (ε)2(ε1)2(ε2)2(ε1)2+(ε2)2(\varepsilon^{*})^{2}\gtrsim(\varepsilon_{1}^{*})^{2}\vee(\varepsilon_{2}^{*})^{2}\asymp(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}, and so the lower bound is proved. To show the upper bound, consider that if f22C((ε1)2+(ε2)2)||f||_{2}^{2}\geq C\left((\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}\right), then by triangle inequality it follows that either |f¯|2C(ε1)2|\bar{f}|^{2}\geq C(\varepsilon_{1}^{*})^{2} or g22C(ε2)2||g||_{2}^{2}\geq C(\varepsilon_{2}^{*})^{2}. Thus, there is enough signal to detect and the test which takes maximum over the subproblem minimax tests is optimal. Hence, the upper bound (ε)2(ε1)2+(ε2)2(\varepsilon^{*})^{2}\lesssim(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2} is now also proved.

All that remains is to determine the minimax rates ε1\varepsilon_{1}^{*} and ε2\varepsilon_{2}^{*}. Let us index starting from zero and take {ψk}k=0\{\psi_{k}\}_{k=0}^{\infty} and {μk}k=0\{\mu_{k}\}_{k=0}^{\infty} to be the associated eigenfunctions and eigenvalues of the RKHS 0\mathcal{H}_{0}. Recall {ψk}k=0\{\psi_{k}\}_{k=0}^{\infty} forms an orthonormal basis for L2([0,1])L^{2}([0,1]) under the usual L2L^{2} inner product. Since by definition 0\mathcal{H}_{0} is orthogonal to the span of the constant function in L2([0,1])L^{2}([0,1]), without loss of generality we can take ψ0\psi_{0} to be the constant function equal to one, μ0=0\mu_{0}=0, and {ψk}k=1\{\psi_{k}\}_{k=1}^{\infty} to be the remaining eigenfunctions orthogonal (in L2L^{2}) to constant functions.

The data {Xk,j}k,j[p]\{X_{k,j}\}_{k\in\mathbb{N},j\in[p]} (defined in (5)) is sufficient for Problem II. Therefore, the minimax rate ε2\varepsilon_{2}^{*} is exactly given by our result established in this paper. The minimax rate ε1\varepsilon_{1}^{*} can be upper bounded in the following manner. Consider that 𝟏p,dYxL2([0,1]p)N(f¯,1n)\langle\mathbf{1}_{p},dY_{x}\rangle_{L^{2}([0,1]^{p})}\sim N(\bar{f},\frac{1}{n}) since 𝟏p22=1||\mathbf{1}_{p}||_{2}^{2}=1. Therefore, the parametric rate ε11n\varepsilon_{1}^{*}\lesssim\frac{1}{\sqrt{n}} can be achieved. Hence, (ε)2(ε1)2+(ε2)2(ε2)2(\varepsilon^{*})^{2}\asymp(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}\asymp(\varepsilon_{2}^{*})^{2}. In other words, centering is immaterial to the fundamental limits of the signal detection problem.

5.2 Sharp constant

As we noted earlier, our results do not say anything about the sharp constant in the detection boundary. The problem of obtaining a sharp characterization of the constants in the detection boundary is interesting and likely delicate. In an asymptotic setup and in the Sobolev case, Gayraud and Ingster [17] were able to derive the sharp constant in the sparse regime s=p1βs=p^{1-\beta} for fixed β(1/2,1)\beta\in(1/2,1) under the condition logp=o(n12α+1)\log p=o(n^{\frac{1}{2\alpha+1}}). Gayraud and Ingster discuss that this condition is actually essential, in that the detection boundary no longer exhibits a dependence on β\beta when the condition is violated. This condition has a nice formulation in our notation, namely it is equivalent to the condition

log(p/s2)n=o(Γ).\frac{\log(p/s^{2})}{n}=o\left(\Gamma_{\mathcal{H}}\right).

This correspondence in the Sobolev case suggests this condition may actually be essential for a generic \mathcal{H}. It would be interesting to understand if that is true, and to derive the sharp constant when it holds if it so.

To be clear, it may be the case (perhaps likely) our proposed procedure is suboptimal in terms of the constant. Indeed, the existing literature on sparse signal detection, both in sparse sequence model [14, 18] and sparse linear regression [26, 1] rely on Higher Criticism type tests to achieve the optimal constant. Gayraud and Ingster [17] themselves use Higher Criticism. For a generic space \mathcal{H}, our procedure should not be understood as the only test which is rate-optimal. In the sparsity regime s=p1/2δs=p^{1/2-\delta}, we suspect an analogous Higher Criticism type statistic which accounts for the eigenstructure of the kernel might not only achieve the optimal rate, but also the sharp constant.

5.3 Future directions

There are a number of avenues for future work. First, we only considered one space \mathcal{H} in which all component functions fjf_{j} live in. In some scenarios, it may be desirable to consider a different space j\mathcal{H}_{j} for each component. Raskutti et al. [38] obtained the minimax estimation rate when considering multiple kernels (under some conditions on the kernels). We imagine our broad approach in this work could be extended to determine the minimax separation rate allowing multiple kernels. Instead of a common ν\nu_{\mathcal{H}}, it is likely different quantities νj\nu_{\mathcal{H}_{j}} will be needed per coordinate and the test statistics could be modified in the natural way. The theory developed here could be used directly, and it seems plausible the minimax separation rate could be established in a straightforward manner.

Another avenue of research involves considering “soft” sparsity in the form of q\ell_{q} constraints for 0<q<10<q<1. Yuan and Zhou [50] developed minimax estimation rates for RKHSs exhibiting polynomial decay of its eigenvalues (e.g. Sobolev space). In terms of signal detection, its plausible that the quadratic functional estimator under the Gaussian sequence model with q\ell_{q} bound on the mean could be extended and used as a test statistic [10]. The hard sparsity and soft sparsity settings studied in [10] are handled quite similarly. It is possible not much additional theory needs to be developed in order to obtain minimax separation rates under soft sparsity.

Since hypothesis testing is quite closely related to functional estimation in many problems, it is natural to ask about functional estimation in the context of sparse additive models. For example, it would be of interest to estimate the quadratic functional f22||f||_{2}^{2} or the norm f2||f||_{2}. It is well known in the nonparametric literature that estimating LrL_{r} norms for odd rr yields drastically different rates from testing and from even rr [32, 19]. A compelling direction is to investigate the same problem in the sparse additive model setting.

Additionally, it would be interesting to consider the nonparametric regression model Yi=f(Xi)+ZiY_{i}=f(X_{i})+Z_{i} where the design distribution XiiidPXX_{i}\overset{iid}{\sim}P_{X} exhibits some dependence between the coordinates. The correspondence between the white noise model (1) and the nonparametric regression model we relied on requires that the design distribution be close to uniform on [0,1]p[0,1]^{p}. However, in practical situations it is typically the case that the coordinates of XiX_{i} exhibit dependence, and it would be interesting to understand how the fundamental limits of testing are affected.

Finally, in a somewhat different direction than that discussed so far, it is of interest to study the signal detection problem under group sparsity in other models. As we had encouraged, our results can be interpreted exclusively in terms of sequence space, that is the problem (7)-(8) with parameter space (6). From this perspective, the group sparse structure is immediately apparent. Estimation has been extensively studied in group sparse settings, especially in linear regression (see [47] and references therein). Hypothesis testing has not witnessed nearly the same level of research activity, and so there is ample opportunity. We imagine some features of the rates established in this paper are general features of the group sparse structure, and it would be intriguing to discover the commonalities.

6 Acknowledgments

SK is grateful to Oleg Lepski for a kind correspondence. The research of SK is supported in part by NSF Grants DMS-1547396 and ECCS-2216912. The research of CG is supported in part by NSF Grant ECCS-2216912, NSF Career Award DMS-1847590, and an Alfred Sloan fellowship.

7 Proofs

7.1 Minimax upper bounds

7.1.1 Sparse

Proof of Proposition 2.

Fix η(0,1)\eta\in(0,1). We will set CηC_{\eta} at the end of the proof, so for now let C>CηC>C_{\eta}. The universal constant K1K_{1} will be selected in the course of the proof. Let LL^{*} denote the universal constant from Lemma 21. Set K2:=11(log2)1/4c1/4(Lc)1/4K_{2}:=1\vee\frac{1}{(\log 2)^{1/4}}\vee c^{-1/4}\vee\left(\frac{L^{*}}{c}\right)^{1/4} and K3:=LK22K_{3}:=\frac{L^{*}}{K_{2}^{2}} where c=ccc=c^{*}\wedge c^{**} where cc^{*} and cc^{**} are the universal constants in the exponential terms of Lemmas 15 and 14 respectively.

We first bound the Type I error. Since log(1+ps2)K3d\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{3}\sqrt{d}, we have 1K22log(1+ps2)K22K3dLd1\leq K_{2}^{2}\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{2}^{2}K_{3}\sqrt{d}\leq L^{*}\sqrt{d}. Therefore, we can apply Lemma 15 to obtain that

P0{Tr(d)>C(xpr4ecr4d+dr2x)}exP_{0}\left\{T_{r}(d)>C^{*}\left(\sqrt{xpr^{4}e^{-\frac{c^{*}r^{4}}{d}}}+\frac{d}{r^{2}}x\right)\right\}\leq e^{-x}

for any x>0x>0. Here, CC^{*} is a universal constant. Taking x=Cx=C and noting that C>1C>1 provided we select Cη1C_{\eta}\geq 1, we see that

C(xpr4ecr4d+dr2x)\displaystyle C^{*}\left(\sqrt{xpr^{4}e^{-\frac{c^{*}r^{4}}{d}}}+\frac{d}{r^{2}}x\right) CC(K22pdlog(1+ps2)ecK24log(1+ps2)+dr2)\displaystyle\leq C^{*}C\left(K_{2}^{2}\sqrt{pd\log\left(1+\frac{p}{s^{2}}\right)e^{-c^{*}K_{2}^{4}\log\left(1+\frac{p}{s^{2}}\right)}}+\frac{d}{r^{2}}\right)
CC(K22dlog(1+ps2)ps2s2+p+dr2)\displaystyle\leq C^{*}C\left(K_{2}^{2}\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}\sqrt{p\cdot\frac{s^{2}}{s^{2}+p}}+\frac{d}{r^{2}}\right)
CC(K22sdlog(1+ps2)+dr2)\displaystyle\leq C^{*}C\left(K_{2}^{2}s\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}+\frac{d}{r^{2}}\right)
2CK22Csdlog(1+ps2)\displaystyle\leq 2C^{*}K_{2}^{2}Cs\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}
=CK1nτ(p,s,n)2\displaystyle=CK_{1}n\tau(p,s,n)^{2}

where we have used that cK241c^{*}K_{2}^{4}\geq 1 and where we have selected K1=2CK22K_{1}=2C^{*}K_{2}^{2}. Note we have also used that dr2dK22sdlog(1+ps2)\frac{d}{r^{2}}\leq\sqrt{d}\leq K_{2}^{2}s\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}. Thus, with these choices of K1,K2,K_{1},K_{2}, and xx, we have

P0{Tr(d)>CK1nτ(p,s,n)2}ex=eCeCηη2P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}\leq e^{-x}=e^{-C}\leq e^{-C_{\eta}}\leq\frac{\eta}{2}

provided we select Cηlog(2η)1C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1.

We now examine the Type II error. To bound the Type II error, we will use Chebyshev’s inequality. In particular, consider that for any fsf\in\mathcal{F}_{s} with f2Cτ(p,s,n)||f||_{2}\geq C\tau(p,s,n), we have

Pf{Tr(d)CK1nτ(p,s,n)2}\displaystyle P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\} =Pf{Ef(Tr(d))CK1nτ(p,s,n)2Ef(Tr(d))Tr(d)}\displaystyle=P_{f}\left\{E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\leq E_{f}(T_{r}(d))-T_{r}(d)\right\}
Varf(Tr(d))(Ef(Tr(d))CK1nτ(p,s,n)2)2\displaystyle\leq\frac{\operatorname{Var}_{f}\left(T_{r}(d)\right)}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}} (30)

provided that Ef(Tr(d))CK1nτ(p,s,n)2E_{f}(T_{r}(d))\geq CK_{1}n\tau(p,s,n)^{2}. To ensure this application of Chebyshev’s inequality is valid, we must compute suitable bounds for the expectation and variance of Tr(d)T_{r}(d), which the following lemmas provide.

Lemma 4.

If CηC_{\eta} is larger than some sufficiently large universal constant, then Ef(Tr(d))2CK1nτ(p,s,n)2E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}.

Lemma 5.

If CηC_{\eta} is larger than some sufficiently large universal constant, then Varf(Tr(d))C(s2r4+Ef(Tr(d)))\operatorname{Var}_{f}(T_{r}(d))\leq C^{\dagger}\left(s^{2}r^{4}+E_{f}(T_{r}(d))\right) where C>0C^{\dagger}>0 is a universal constant.

These bounds are proved later on. Let us now describe why they allow us to bound the Type II error. From (30) as well as Lemmas 4 and 5, we have

Pf{Tr(d)CK1nτ(p,s,n)2}\displaystyle P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\} Varf(Tr(d))(Ef(Tr(d))CK1nτ(p,s,n)2)2\displaystyle\leq\frac{\operatorname{Var}_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}
Cs2r4+CEf(Tr(d))(Ef(Tr(d))CK1nτ(p,s,n)2)2\displaystyle\leq\frac{C^{\dagger}s^{2}r^{4}+C^{\dagger}E_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}
CDK24s2νlog(1+ps2)C2K12n2τ(p,s,n)4+CEf(Tr(d))(Ef(Tr(d))CK1nτ(p,s,n)2)2\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}s^{2}\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}{C^{2}K_{1}^{2}n^{2}\tau(p,s,n)^{4}}+\frac{C^{\dagger}E_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}
=CDK24C2K12+CEf(Tr(d))(Ef(Tr(d))CK1nτ(p,s,n)2)2\displaystyle=\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{C^{\dagger}E_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}
CDK24C2K12+CEf(Tr(d))14(Ef(Tr(d)))2\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{C^{\dagger}E_{f}(T_{r}(d))}{\frac{1}{4}\left(E_{f}(T_{r}(d))\right)^{2}}
CDK24C2K12+4CEf(Tr(d))\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{4C^{\dagger}}{E_{f}(T_{r}(d))}
CDK24C2K12+4C2CK1nτ(p,s,n)2\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{4C^{\dagger}}{2CK_{1}n\tau(p,s,n)^{2}}
CDK24C2K12+2CCK1log2\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{2C^{\dagger}}{CK_{1}\sqrt{\log 2}}

provided we pick CηC_{\eta} larger than a sufficiently large universal constant as required by Lemmas 4 and 5. With this bound in hand and since CCηC\geq C_{\eta}, we can now pick CηC_{\eta} sufficiently large depending only on η\eta to obtain

supfs,f2Cτ(p,s,n)Pf{Tr(d)CK1nτ(p,s,n)}η2.\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)\right\}\leq\frac{\eta}{2}.

To summarize, we can pick CηC_{\eta} depending only on η\eta to be sufficiently large and also satisfying the condition Cηlog(2η)1C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1 and those of Lemmas 4, 5 to ensure the testing risk is bounded by η\eta, i.e.

P0{Tr(d)>CK1nτ(p,s,n)2}+supfs,f2Cτ(p,s,n)Pf{Tr(d)CK1nτ(p,s,n)2}η,P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta,

as desired. ∎

It remains to prove Lemmas 4 and 5. Recall we work in the environment of the proof of Proposition 2.

Proof of Lemma 4.

To prove the lower bound on the expectation, first recall

Ef(Tr(d))=j=1pEf((Ej(d)αr(d))𝟙{Ej(d)d+r2}).E_{f}(T_{r}(d))=\sum_{j=1}^{p}E_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right).

Under PfP_{f}, we have Ej(d)χd2(mj2)E_{j}(d)\sim\chi^{2}_{d}(m_{j}^{2}) where mj2=nkdθk,j2m_{j}^{2}=n\sum_{k\leq d}\theta_{k,j}^{2}. Here, the collection {θk,j}\{\theta_{k,j}\} denotes the basis coefficients of ff. Intuitively, there are two reasons why the expectation might be small. First, we are thresholding and so we are intuitively removing those coordinates with small, but nonetheless nonzero, means. Furthermore, we are truncating at level dd and not considering higher-order basis coefficients; this also incurs a loss in signal.

Let us first focus on the effect from thresholding. Let C~\widetilde{C} denote the universal constant from Lemma 13. Applying Lemma 13 yields

Ef(Tr(d))\displaystyle E_{f}(T_{r}(d)) =j=1pEf((Ej(d)αr(d))𝟙{Ej(d)d+r2})\displaystyle=\sum_{j=1}^{p}E_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)
=jSfEf((Ej(d)αr(d))𝟙{Ej(d)d+r2})\displaystyle=\sum_{j\in S_{f}}E_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)
jSf:mj2C~r2mj22\displaystyle\geq\sum_{j\in S_{f}:m_{j}^{2}\geq\widetilde{C}r^{2}}\frac{m_{j}^{2}}{2}
=jSfmj22jSf:mj2<C~r2mj22\displaystyle=\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}-\sum_{j\in S_{f}:m_{j}^{2}<\widetilde{C}r^{2}}\frac{m_{j}^{2}}{2}
(jSfmj22)C~sr22\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}

where Sf[p]S_{f}\subset[p] denotes the subset of active variables jj such that Θj0\Theta_{j}\neq 0. Note that such SfS_{f} exists and |Sf|s|S_{f}|\leq s since fsf\in\mathcal{F}_{s}. We have thus bounded the amount of signal lost from thresholding.

Let us now examine the effect of truncation. Consider that

f22\displaystyle||f||_{2}^{2} jSfk=1θk,j2\displaystyle\leq\sum_{j\in S_{f}}\sum_{k=1}^{\infty}\theta_{k,j}^{2}
=jSf(kdθk,j2+k>dθk,j2)\displaystyle=\sum_{j\in S_{f}}\left(\sum_{k\leq d}\theta_{k,j}^{2}+\sum_{k>d}\theta_{k,j}^{2}\right)
jSf(kdθk,j2+μd+1k>dθk,j2μk)\displaystyle\leq\sum_{j\in S_{f}}\left(\sum_{k\leq d}\theta_{k,j}^{2}+\mu_{d+1}\sum_{k>d}\frac{\theta_{k,j}^{2}}{\mu_{k}}\right)
(jSfkdθk,j2)+sμd+1.\displaystyle\leq\left(\sum_{j\in S_{f}}\sum_{k\leq d}\theta_{k,j}^{2}\right)+s\mu_{d+1}.

Here, we have used k=1μk1θk,j21\sum_{k=1}^{\infty}\mu_{k}^{-1}\theta_{k,j}^{2}\leq 1 for all jj. Therefore, we have shown jSfkdθk,j2f22sμd+1\sum_{j\in S_{f}}\sum_{k\leq d}\theta_{k,j}^{2}\geq||f||_{2}^{2}-s\mu_{d+1}, and thus have quantified the loss due to truncation.

We are now in position to put together the two pieces. Consider f22C2τ(p,s,n)2||f||_{2}^{2}\geq C^{2}\tau(p,s,n)^{2}. Consequently,

jSfmj22C2nτ(p,s,n)2nsμd+12C2nτ(p,s,n)2nsμν+12.\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\geq\frac{C^{2}n\tau(p,s,n)^{2}-ns\mu_{d+1}}{2}\geq\frac{C^{2}n\tau(p,s,n)^{2}-ns\mu_{\nu_{\mathcal{H}}+1}}{2}.

We have used the decreasing order of the kernel’s eigenvalues, i.e. that dνd\geq\nu_{\mathcal{H}} implies μν+1μd+1\mu_{\nu_{\mathcal{H}}+1}\geq\mu_{d+1}. Further, consider that by definition of ν\nu_{\mathcal{H}} and τ(p,s,n)2\tau(p,s,n)^{2}, we have

μν+1μννlog(1+ps2)n=τ(p,s,n)2s.\mu_{\nu_{\mathcal{H}}+1}\leq\mu_{\nu_{\mathcal{H}}}\leq\frac{\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}=\frac{\tau(p,s,n)^{2}}{s}.

With this in hand, it follows that jSfmj22(C212)nτ(p,s,n)2\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\geq\left(\frac{C^{2}-1}{2}\right)n\tau(p,s,n)^{2}. To summarize, we have shown

Ef(Tr(d))\displaystyle E_{f}(T_{r}(d)) (jSfmj22)C~sr22\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}
(jSfmj22)C~K22Dnτ(p,s,n)22\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}K_{2}^{2}\sqrt{\lceil D\rceil}n\tau(p,s,n)^{2}}{2}
(jSfmj22)(1C~K22DC21).\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)\left(1-\frac{\widetilde{C}K_{2}^{2}\sqrt{\lceil D\rceil}}{C^{2}-1}\right). (31)

Here, we have used dDνd\leq\lceil D\rceil\nu_{\mathcal{H}}. We can also conclude

Ef(Tr(d))(C21K22C~D2)nτ(p,s,n)2.E_{f}(T_{r}(d))\geq\left(\frac{C^{2}-1-K_{2}^{2}\widetilde{C}\sqrt{\lceil D\rceil}}{2}\right)n\tau(p,s,n)^{2}.

In view of the above bound and since C>CηC>C_{\eta}, it suffices to pick CηC_{\eta} large enough to satisfy Cη24K1Cη1+K22C~C_{\eta}^{2}-4K_{1}C_{\eta}\geq 1+K_{2}^{2}\widetilde{C} to ensure Ef(Tr(d))2CK1nτ(p,s,n)2E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}. The proof of the lemma is complete. ∎

Proof of Lemma 5.

To bound the variance of Tr(d)T_{r}(d), recall that log(1+ps2)K3d\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{3}\sqrt{d}, and so 1K22log(1+ps2)K22K3dLd1\leq K_{2}^{2}\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{2}^{2}K_{3}\sqrt{d}\leq L^{*}\sqrt{d}. Therefore, we can apply Lemma 14. By Lemma 14, we have

Varf(Tr(d))\displaystyle\operatorname{Var}_{f}\left(T_{r}(d)\right) =j=1pVarf((Ej(d)αr(d))𝟙{Ej(d)d+r2})\displaystyle=\sum_{j=1}^{p}\operatorname{Var}_{f}\left((E_{j}(d)-\alpha_{r}(d))\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)
Cpr4exp(cr4d)+Csr4+CjSf:mj2>4r2mj2\displaystyle\leq C^{\dagger}pr^{4}\exp\left(-\frac{c^{**}r^{4}}{d}\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}:m_{j}^{2}>4r^{2}}m_{j}^{2}
Cpr4exp(cr4d)+Csr4+CjSfmj2\displaystyle\leq C^{\dagger}pr^{4}\exp\left(-\frac{c^{**}r^{4}}{d}\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}

where CC^{\dagger} is a positive universal constant whose value can change from instance to instance. Recall cc^{**} is defined at the beginning of the proof of Proposition 2. Since r4=K24dlog(1+ps2)r^{4}=K_{2}^{4}d\log\left(1+\frac{p}{s^{2}}\right) and cK241c^{**}K_{2}^{4}\geq 1, we have

pr4exp(cr4d)\displaystyle pr^{4}\exp\left(-\frac{c^{**}r^{4}}{d}\right) pr4exp(log(1+ps2))pr4s2s2+ps2r4.\displaystyle\leq pr^{4}\exp\left(-\log\left(1+\frac{p}{s^{2}}\right)\right)\leq pr^{4}\cdot\frac{s^{2}}{s^{2}+p}\leq s^{2}r^{4}.

Therefore,

Varf(Tr(d))Cs2r4+Csr4+CjSfmj22Cs2r4+CjSfmj2.\operatorname{Var}_{f}(T_{r}(d))\leq C^{\dagger}s^{2}r^{4}+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}\leq 2C^{\dagger}s^{2}r^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}.

Taking CηC_{\eta} larger than a sufficiently large universal constant, noting CCηC\geq C_{\eta}, and invoking (31), we have the desired result. ∎

Proof of Proposition 3.

Our proof is largely the same as in the proof of Proposition 2, except we invoke results about the “tail” rather than the “bulk”. Fix η(0,1)\eta\in(0,1). We will make a choice of CηC_{\eta} at the end of the proof, so for now let C>CηC>C_{\eta}. We select K1K_{1} in the course of the proof, but we will select K2K_{2} now. Set K2:=1log2c1/2(cK32)1/4K_{2}:=\frac{1}{\sqrt{\log 2}}\vee c^{-1/2}\vee\left(cK_{3}^{2}\right)^{-1/4} with c=ccc=c^{*}\wedge c^{**} where cc^{*} and cc^{**} are the universal constants in the exponential terms of Lemmas 11 and 10 respectively.

We first bound the Type I error. Since log(1+ps2)K32d\log\left(1+\frac{p}{s^{2}}\right)\geq K_{3}^{2}d, we have r2dr^{2}\gtrsim d. Since dDd\geq D, we have by Lemma 11 that for any x>0x>0,

P0{Tr(d)>C(xpr4ecr2+x)}exP_{0}\left\{T_{r}(d)>C^{*}\left(\sqrt{xpr^{4}e^{-c^{*}r^{2}}}+x\right)\right\}\leq e^{-x}

where CC^{*} is a universal positive constant. Taking x=Cx=C and noting C>1C>1 provided we have chosen Cη1C_{\eta}\geq 1, we see that

C(xpr4ecr2+x)\displaystyle C^{*}\left(\sqrt{xpr^{4}e^{-c^{*}r^{2}}}+x\right) CC(K22log(1+ps2)pecK22log(1+ps2)+1)\displaystyle\leq C^{*}C\left(K_{2}^{2}\log\left(1+\frac{p}{s^{2}}\right)\sqrt{pe^{-c^{*}K_{2}^{2}\log\left(1+\frac{p}{s^{2}}\right)}}+1\right)
CC(K22log(1+ps2)ps2s2+p+1)\displaystyle\leq C^{*}C\left(K_{2}^{2}\log\left(1+\frac{p}{s^{2}}\right)\sqrt{p\cdot\frac{s^{2}}{s^{2}+p}}+1\right)
2CCK22slog(1+ps2)\displaystyle\leq 2C^{*}CK_{2}^{2}s\log\left(1+\frac{p}{s^{2}}\right)
=CK1nτ(p,s,n)2\displaystyle=CK_{1}n\tau(p,s,n)^{2}

where we have used that cK221c^{*}K_{2}^{2}\geq 1 and we have set K1:=2CK22K_{1}:=2C^{*}K_{2}^{2}. Thus, with these choices of K1,K2,K_{1},K_{2}, and xx we have

P0{Tr(d)>CK1nτ(p,s,n)2}ex=eCeCηη2P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}\leq e^{-x}=e^{-C}\leq e^{-C_{\eta}}\leq\frac{\eta}{2}

provided we select Cηlog(2η)1C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1.

We now examine the Type II error. To bound the Type II error, we will use Chebyshev’s inequality. In particular, consider that for any fsf\in\mathcal{F}_{s} with f2Cτ(p,s,n)||f||_{2}\geq C\tau(p,s,n), we have

Pf{Tr(d)CK1nτ(p,s,n)2}\displaystyle P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\} =Pf{Ef(Tr(d))CK1nτ(p,s,n)2Ef(Tr(d))Tr(d)}\displaystyle=P_{f}\left\{E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\leq E_{f}(T_{r}(d))-T_{r}(d)\right\}
Varf(Tr(d))(Ef(Tr(d))CK1nτ(p,s,n)2)2\displaystyle\leq\frac{\operatorname{Var}_{f}\left(T_{r}(d)\right)}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}} (32)

provided that Ef(Tr(d))CK1nτ(p,s,n)2E_{f}(T_{r}(d))\geq CK_{1}n\tau(p,s,n)^{2}. To ensure this application of Chebyshev’s inequality is valid and to bound the Type II error, we will need a lower bound on the expectation of Tr(d)T_{r}(d). We will also need an upper bound on the variance of Tr(d)T_{r}(d) in order to bound the Type II error. The following lemmas provide us with the requisite bounds; they are analogous to Lemmas 4 and 5 but are now in the context of the tail regime.

Lemma 6.

If CηC_{\eta} is larger than some sufficiently larger universal constant, then Ef(Tr(d))2CK1nτ(p,s,n)2E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}.

Lemma 7.

If CηC_{\eta} is larger than some sufficiently large universal constant, then Varf(Tr(d))C(s2r4+Ef(Tr(d)))\operatorname{Var}_{f}(T_{r}(d))\leq C^{{\dagger}}(s^{2}r^{4}+E_{f}(T_{r}(d))) where C>0C^{\dagger}>0 is a universal constant.

With these bounds in hand, the argument in the proof of Proposition 2 can be essentially repeated to establish that

P0{Tr(d)>CK1nτ(p,s,n)2}+supfs,f2Cτ(p,s,n)Pf{Tr(d)CK1nτ(p,s,n)2}ηP_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta

provided Cηlog(2η)1C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1 and CηC_{\eta} sufficiently large to satisfy Lemmas 6 and 7. We omit the details for brevity. ∎

It remains to prove Lemmas 6 and 7.

Proof of Lemma 6.

The proof is similar in style to the proof of Lemma 4, except now results for the tail regime are invoked. Letting C~\widetilde{C} denote the universal constant from Lemma 9, applying Lemma 9, and arguing similarly to the proof of Lemma 4 , we obtain

Ef(Tr(d))(jSfmj22)C~sr22E_{f}\left(T_{r}(d)\right)\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}

where Sf[p]S_{f}\subset[p] denotes the subset of active variables jj such that Θj0\Theta_{j}\neq 0. Note that such SfS_{f} exists and |Sf|s|S_{f}|\leq s since fsf\in\mathcal{F}_{s}. Further arguing like the proof of Lemma 4 and using log(1+ps2)K3d\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\geq K_{3}\sqrt{d}, we have

jSfmj22(C21K32)nτ(p,s,n)2.\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\geq\left(\frac{C^{2}-\frac{1}{K_{3}}}{2}\right)n\tau(p,s,n)^{2}. (33)

To summarize, we have shown

Ef(Tr(d))\displaystyle E_{f}\left(T_{r}(d)\right) (jSfmj22)C~sr22\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}
(jSfmj22)C~K22nτ(p,s,n)22\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}K_{2}^{2}n\tau(p,s,n)^{2}}{2}
(jSfmj22)(1C~K22C21K3).\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)\left(1-\frac{\widetilde{C}K_{2}^{2}}{C^{2}-\frac{1}{K_{3}}}\right). (34)

Note that we also can conclude

Ef(Tr(d))(C21K3K22C~2)nτ(p,s,n)2.E_{f}\left(T_{r}(d)\right)\geq\left(\frac{C^{2}-\frac{1}{K_{3}}-K_{2}^{2}\widetilde{C}}{2}\right)n\tau(p,s,n)^{2}.

Since C>CηC>C_{\eta}, it suffices to pick CηC_{\eta} large enough to satisfy Cη24K1Cη1K3+K22C~C_{\eta}^{2}-4K_{1}C_{\eta}\geq\frac{1}{K_{3}}+K_{2}^{2}\widetilde{C} to ensure Ef(Tr(d))2CK1nτ(p,s,n)2E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}. The proof of the lemma is complete. ∎

Proof of Lemma 7.

Recall that log(1+ps2)K32d\log\left(1+\frac{p}{s^{2}}\right)\geq K_{3}^{2}d. By definition of r2r^{2} we have r2K22K32dr^{2}\geq K_{2}^{2}K_{3}^{2}d. In other words r2dr^{2}\gtrsim d and so we can apply Lemma 10. By Lemma 10, we have

Varf(Tr(d))\displaystyle\operatorname{Var}_{f}\left(T_{r}(d)\right) =j=1pVarf((Ej(d)αr(d))𝟙{Ej(d)d+r2})\displaystyle=\sum_{j=1}^{p}\operatorname{Var}_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)
Cpr4exp(cmin(r4d,r2))+Csr4+CjSf:mj2>4r2mj2\displaystyle\leq C^{{\dagger}}pr^{4}\exp\left(-c^{**}\min\left(\frac{r^{4}}{d},r^{2}\right)\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}:m_{j}^{2}>4r^{2}}m_{j}^{2}
Cpr4exp(cmin(r4d,r2))+Csr4+CjSfmj2.\displaystyle\leq C^{{\dagger}}pr^{4}\exp\left(-c^{**}\min\left(\frac{r^{4}}{d},r^{2}\right)\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}.

where CC^{{\dagger}} is a positive universal constant. Recall we had defined cc^{**} at the beginning of the proof of Proposition 3. Since r2K22K32dr^{2}\geq K_{2}^{2}K_{3}^{2}d, we have r4dr2K22K32\frac{r^{4}}{d}\geq r^{2}K_{2}^{2}K_{3}^{2}. Therefore,

exp(cmin(r4d,r2))exp(c(K22K321)r2)exp(log(1+ps2))s2s2+p\displaystyle\exp\left(-c^{**}\min\left(\frac{r^{4}}{d},r^{2}\right)\right)\leq\exp\left(-c^{**}\left(K_{2}^{2}K_{3}^{2}\wedge 1\right)r^{2}\right)\leq\exp\left(-\log\left(1+\frac{p}{s^{2}}\right)\right)\leq\frac{s^{2}}{s^{2}+p}

where we have used that c(K22K321)K221c^{**}\left(K_{2}^{2}K_{3}^{2}\wedge 1\right)K_{2}^{2}\geq 1 by definition of K2K_{2}. Therefore,

Varf(Tr(d))2Cs2r4+CjSfmj2.\operatorname{Var}_{f}\left(T_{r}(d)\right)\leq 2C^{\dagger}s^{2}r^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}.

Taking CηC_{\eta} larger than a sufficiently large universal constant and invoking (7.1.1) yields the desired result. ∎

7.1.2 Dense

Proof of Proposition 4.

For ease of notation, set L=2ηL=\frac{2}{\sqrt{\eta}}. Define Cη:=(22L+1)(1+4η+L)(642η+1)C_{\eta}:=\left(\sqrt{2\sqrt{2}L+1}\right)\vee\left(\sqrt{1+\frac{4}{\sqrt{\eta}}+L}\right)\vee\left(\sqrt{\frac{64\sqrt{2}}{\eta}+1}\right). Let C>CηC>C_{\eta}. We first bound the Type I error. Consider that under P0P_{0} we have Tχpν2T\sim\chi^{2}_{p\nu_{\mathcal{H}}}. Observe that E0(T)=pνE_{0}\left(T\right)=p\nu_{\mathcal{H}} and Var0(T)=2pν\operatorname{Var}_{0}\left(T\right)=2p\nu_{\mathcal{H}}. Consequently, we have by Chebyshev’s inequality

P0{T>pν+Lpν}Var0(T)L2pν=2L2η2.P_{0}\left\{T>p\nu_{\mathcal{H}}+L\sqrt{p\nu_{\mathcal{H}}}\right\}\leq\frac{\operatorname{Var}_{0}(T)}{L^{2}p\nu_{\mathcal{H}}}=\frac{2}{L^{2}}\leq\frac{\eta}{2}.

We now examine the Type II error. Under PfP_{f}, we have Tχpν2(nj=1pkνθk,j2)T\sim\chi^{2}_{p\nu_{\mathcal{H}}}\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right) where {θk,j}\{\theta_{k,j}\} denote the basis coefficients associated to ff. Note Ef(T)=pν+nj=1pkνHθk,j2E_{f}(T)=p\nu_{\mathcal{H}}+n\sum_{j=1}^{p}\sum_{k\leq\nu_{H}}\theta_{k,j}^{2} and Varf(T)=2pν+4nj=1pkνθk,j2\operatorname{Var}_{f}(T)=2p\nu_{\mathcal{H}}+4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}. In order to proceed with the argument, we need to obtain a lower bound estimate for the signal strength. Letting SS denote the set of jj for which Θj\Theta_{j} are nonzero, consider that for f22C2τ(p,s,n)2||f||_{2}^{2}\geq C^{2}\tau(p,s,n)^{2} we have

C2τ(p,s,n)2\displaystyle C^{2}\tau(p,s,n)^{2} f22\displaystyle\leq||f||_{2}^{2}
=j=1pk=1θk,j2\displaystyle=\sum_{j=1}^{p}\sum_{k=1}^{\infty}\theta_{k,j}^{2}
jS(kνθk,j2+k>νθk,j2)\displaystyle\leq\sum_{j\in S}\left(\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}+\sum_{k>\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)
jS(kνθk,j2+μνk>νθk,j2μk)\displaystyle\leq\sum_{j\in S}\left(\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}+\mu_{\nu_{\mathcal{H}}}\sum_{k>\nu_{\mathcal{H}}}\frac{\theta_{k,j}^{2}}{\mu_{k}}\right)
(jSkνθk,j2)+sμν.\displaystyle\leq\left(\sum_{j\in S}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)+s\mu_{\nu_{\mathcal{H}}}.

We have used k=1θk,j2μk1\sum_{k=1}^{\infty}\frac{\theta_{k,j}^{2}}{\mu_{k}}\leq 1 for all jj in the final line. Therefore by definition of ν\nu_{\mathcal{H}} we have

njSkνθk,j2nC2τ(p,s,n)2nsμν(Cη21)νp\displaystyle n\sum_{j\in S}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\geq nC^{2}\tau(p,s,n)^{2}-ns\mu_{\nu_{\mathcal{H}}}\geq\left(C_{\eta}^{2}-1\right)\sqrt{\nu_{\mathcal{H}}p}

where we have used that nsμνsνlog(1+ps2)pνns\mu_{\nu_{\mathcal{H}}}\leq s\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}\leq\sqrt{p\nu_{\mathcal{H}}}.

We now continue with bounding the Type II error. By Chebyshev’s inequality, we have

supfs,f2Cητ(p,s,n)Pf{Tpν+Lpν}\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}P_{f}\left\{T\leq p\nu_{\mathcal{H}}+L\sqrt{p\nu_{\mathcal{H}}}\right\}
=supfs,f2Cητ(p,s,n)Pf{Ef(T)pνLpνEf(T)T}\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}P_{f}\left\{E_{f}(T)-p\nu_{\mathcal{H}}-L\sqrt{p\nu_{\mathcal{H}}}\leq E_{f}(T)-T\right\}
supfs,f2Cητ(p,s,n)Varf(T)(Ef(T)pνLpν)2\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{\operatorname{Var}_{f}\left(T\right)}{\left(E_{f}(T)-p\nu_{\mathcal{H}}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}
supfs,f2Cητ(p,s,n)2pν+4nj=1pkνθk,j2(nj=1pkνθk,j2Lpν)2\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2p\nu_{\mathcal{H}}+4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}
=supfs,f2Cητ(p,s,n)2pν(Cη21L)2νp+4nj=1pkνθk,j2(nj=1pkνθk,j2Lpν)2\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2p\nu_{\mathcal{H}}}{\left(C_{\eta}^{2}-1-L\right)^{2}\nu_{\mathcal{H}}p}+\frac{4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}
supfs,f2Cητ(p,s,n)2(Cη21L)2+4nj=1pkνθk,j2(12nj=1pkνθk,j2)2\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(\frac{1}{2}n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)^{2}}
=supfs,f2Cητ(p,s,n)2(Cη21L)2+16nj=1pkνθk,j2\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}
2(Cη21L)2+16(Cη21)νHp\displaystyle\leq\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{(C_{\eta}^{2}-1)\sqrt{\nu_{H}p}}
2(Cη21L)2+16Cη21\displaystyle\leq\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{C_{\eta}^{2}-1}
η4+η4\displaystyle\leq\frac{\eta}{4}+\frac{\eta}{4}
η2.\displaystyle\leq\frac{\eta}{2}.

Therefore, the sum of Type I and Type II errors is bounded by η\eta as desired. ∎

7.2 Minimax lower bounds

Proof of Proposition 1.

We break up the analysis into two cases.

Case 1: Suppose s<ps<\sqrt{p}. We will construct a prior distribution π\pi on 𝒯s\mathscr{T}_{s} and use Le Cam’s two point method to furnish a lower bound. Define cη:=1κκlog(1+4η2)c_{\eta}:=1\wedge\sqrt{\kappa}\wedge\sqrt{\kappa\log\left(1+4\eta^{2}\right)}. Let 0<c<cη0<c<c_{\eta}. Let π\pi be the prior in which a draw Θπ\Theta\sim\pi is constructed by uniformly drawing S[p]S\subset[p] of size ss and setting

Θj={ce1if jS,0otherwise\Theta_{j}=\begin{cases}ce_{1}&\textit{if }j\in S,\\ 0&\textit{otherwise}\end{cases}

where e1e_{1}\in\mathbb{R}^{\mathbb{N}} is given by e1=(1,0,0,)e_{1}=(1,0,0,...). Note that Θπ\Theta\sim\pi implies ΘF2=c2s||\Theta||_{F}^{2}=c^{2}s and Θ𝒯s\Theta\in\mathscr{T}_{s}. By Neyman-Pearson lemma and the inequality 1dTV(Q,P)1χ2(Q||P)/21-d_{TV}(Q,P)\geq 1-\sqrt{\chi^{2}(Q||P)}/2, we have

infφ{P0{φ0}+supΘ𝒯s,ΘFcsPΘ{φ1}}112χ2(Pπ||P0)\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\sqrt{s}\end{subarray}}P_{\Theta}\left\{\varphi\neq 1\right\}\right\}\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}||P_{0})} (35)

where PπP_{\pi} denotes the mixture PΘπ(dΘ)\int P_{\Theta}\,\pi(d\Theta) induced by π\pi. By the Ingster-Suslina method (Proposition 5) and Corollary 3, we have

χ2(Pπ||P0)\displaystyle\chi^{2}(P_{\pi}||P_{0}) =E(exp(nΘ,Θ~F))1\displaystyle=E\left(\exp\left(n\langle\Theta,\widetilde{\Theta}\rangle_{F}\right)\right)-1
=E(exp(c2n|SS~|))1\displaystyle=E\left(\exp\left(c^{2}n|S\cap\widetilde{S}|\right)\right)-1
(1sp+spec2n)s1.\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{c^{2}n}\right)^{s}-1.

where Θ,Θ~iidπ\Theta,\widetilde{\Theta}\overset{iid}{\sim}\pi and S,S~S,\widetilde{S} are the corresponding random sets. Now, using that log(1+ps2)κn\log\left(1+\frac{p}{s^{2}}\right)\geq\kappa n, we have

(1sp+spec2κ1log(1+ps2))s1\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{c^{2}\kappa^{-1}\log\left(1+\frac{p}{s^{2}}\right)}\right)^{s}-1
=(1sp+sp(1+ps2)c2κ1)s1\displaystyle=\left(1-\frac{s}{p}+\frac{s}{p}\left(1+\frac{p}{s^{2}}\right)^{c^{2}\kappa^{-1}}\right)^{s}-1
(1sp+sp+c2κ1s)s1\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}+\frac{c^{2}\kappa^{-1}}{s}\right)^{s}-1
exp(c2κ)1\displaystyle\leq\exp\left(\frac{c^{2}}{\kappa}\right)-1
4η2\displaystyle\leq 4\eta^{2}

We have also used that c2κ1<cη2κ11c^{2}\kappa^{-1}<c_{\eta}^{2}\kappa^{-1}\leq 1 and the inequality (1+x)y1+xy(1+x)^{y}\leq 1+xy for x>0x>0 and y(0,1)y\in(0,1). Plugging into (35) yields the desired result.

Case 2: Suppose sps\geq\sqrt{p}. We can repeat the analysis done in Case 1 with the modification of replacing every instance of ss with p\lceil\sqrt{p}\rceil. ∎

Proof of Theorem 1.

We will construct a prior distribution π\pi on 𝒯s\mathscr{T}_{s} and use Le Cam’s two point method to furnish a lower bound. Set cη:=121/4(log(1+4η2))1/4c_{\eta}:=1\wedge 2^{1/4}\wedge\left(\log\left(1+4\eta^{2}\right)\right)^{1/4}. Let 0<c<cη0<c<c_{\eta}. We break up the analysis into two cases.

Case 1: Suppose we are in the regime where ψ(p,s,n)2=sΓ\psi(p,s,n)^{2}=s\Gamma_{\mathcal{H}}. Set ρ:=Γν\rho:=\sqrt{\frac{\Gamma_{\mathcal{H}}}{\nu_{\mathcal{H}}}}. Let π\pi be the prior in which a draw Θπ\Theta\sim\pi is obtained by uniformly drawing S[p]S\subset[p] of size ss and drawing

θi,k{Uniform{cρ,cρ}if iS and k<ν,δ0otherwise\theta_{i,k}\sim\begin{cases}\operatorname{Uniform}\{-c\rho,c\rho\}&\textit{if }i\in S\text{ and }k<\nu_{\mathcal{H}},\\ \delta_{0}&\textit{otherwise}\end{cases}

where δ0\delta_{0} denotes the probability measure placing full mass at zero. Note that Θπ\Theta\sim\pi implies ΘF2=c2ρ2s(ν1)=c2sΓν1νc22sΓ||\Theta||_{F}^{2}=c^{2}\rho^{2}s(\nu_{\mathcal{H}}-1)=c^{2}s\Gamma_{\mathcal{H}}\cdot\frac{\nu_{\mathcal{H}}-1}{\nu_{\mathcal{H}}}\geq\frac{c^{2}}{2}s\Gamma_{\mathcal{H}}. Here, we have used that ν2\nu_{\mathcal{H}}\geq 2 since log(1+ps2)n2\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}. Furthermore, consider that for iSi\in S, we have

=1θi,2μ=c2ρ2<ν1μ=c21ν<νΓμc21ν<νμν1μc2cη21.\displaystyle\sum_{\ell=1}^{\infty}\frac{\theta_{i,\ell}^{2}}{\mu_{\ell}}=c^{2}\rho^{2}\sum_{\ell<\nu_{\mathcal{H}}}\frac{1}{\mu_{\ell}}=c^{2}\frac{1}{\nu_{\mathcal{H}}}\sum_{\ell<\nu_{\mathcal{H}}}\frac{\Gamma_{\mathcal{H}}}{\mu_{\ell}}\leq c^{2}\frac{1}{\nu_{\mathcal{H}}}\sum_{\ell<\nu_{\mathcal{H}}}\frac{\mu_{\nu_{\mathcal{H}}-1}}{\mu_{\ell}}\leq c^{2}\leq c_{\eta}^{2}\leq 1.

Here, we have used the ordering of the eigenvalues μ1μ20\mu_{1}\geq\mu_{2}\geq...\geq 0 and Γμν1\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}-1} by Lemma 25. Of course, for iSi\not\in S we have Θi=0\Theta_{i}=0. Hence we have Θ𝒯s\Theta\in\mathscr{T}_{s}. We then have by the Neyman-Pearson lemma and the inequality 1dTV(Q,P)1χ2(Q||P)/21-d_{TV}(Q,P)\geq 1-\sqrt{\chi^{2}(Q||P)}/2 that

infφ{P0{φ0}+supΘ𝒯s,ΘFcψ(p,s,n)PΘ{φ1}}112χ2(Pπ||P0).\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\psi(p,s,n)\end{subarray}}P_{\Theta}\left\{\varphi\neq 1\right\}\right\}\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}||P_{0})}. (36)

For the following calculations, let Θ,Θiidπ\Theta,\Theta^{\prime}\overset{iid}{\sim}\pi and S,SS,S^{\prime} are the corresponding random sets. Also, let {ri,ri}1ip,\left\{r_{i\ell},r^{\prime}_{i\ell}\right\}_{1\leq i\leq p,\ell\in\mathbb{N}} denote an iid collection of Rademacher(1/2)\operatorname{Rademacher}\left(1/2\right) random variables. By the Ingster-Suslina method (Proposition 5), independence of {S,S}\{S,S^{\prime}\} with {ri,ri}1ip,\left\{r_{i\ell},r^{\prime}_{i\ell}\right\}_{1\leq i\leq p,\ell\in\mathbb{N}}, and Corollary 3, we have

χ2(Pπ||P0)\displaystyle\chi^{2}(P_{\pi}||P_{0}) =E(exp(nΘ,ΘF))1\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)-1
=E(exp(nc2ρ2iSSνriri))1\displaystyle=E\left(\exp\left(nc^{2}\rho^{2}\sum_{i\in S\cap S^{\prime}}\sum_{\ell\leq\nu_{\mathcal{H}}}r_{i\ell}r^{\prime}_{i\ell}\right)\right)-1
=E(iSSE(exp(nc2ρ2νriri)))1\displaystyle=E\left(\prod_{i\in S\cap S^{\prime}}E\left(\exp\left(nc^{2}\rho^{2}\sum_{\ell\leq\nu_{\mathcal{H}}}r_{i\ell}r^{\prime}_{i\ell}\right)\right)\right)-1
=E(cosh(nc2ρ2)ν|SS|)1\displaystyle=E\left(\cosh\left(nc^{2}\rho^{2}\right)^{\nu_{\mathcal{H}}|S\cap S^{\prime}|}\right)-1
E(exp(c4n2ρ4ν2|SS|))1\displaystyle\leq E\left(\exp\left(\frac{c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}}{2}|S\cap S^{\prime}|\right)\right)-1
(1sp+spec4n2ρ4ν2)s1.\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{\frac{c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}}{2}}\right)^{s}-1.

Consider by Lemma 1 that c4n2ρ4ν=c4n2Γ2νc4log(1+ps2)c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}=\frac{c^{4}n^{2}\Gamma_{\mathcal{H}}^{2}}{\nu_{\mathcal{H}}}\leq c^{4}\log\left(1+\frac{p}{s^{2}}\right). Therefore,

(1sp+spec4n2ρ4ν2)s1(1sp+spec42log(1+ps2))s1(1+c42s)s1ecη4/21=4η2.\displaystyle\left(1-\frac{s}{p}+\frac{s}{p}e^{\frac{c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}}{2}}\right)^{s}-1\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{\frac{c^{4}}{2}\log\left(1+\frac{p}{s^{2}}\right)}\right)^{s}-1\leq\left(1+\frac{c^{4}}{2s}\right)^{s}-1\leq e^{c_{\eta}^{4}/2}-1=4\eta^{2}.

We have used c42<cη421\frac{c^{4}}{2}<\frac{c_{\eta}^{4}}{2}\leq 1 and the inequality (1+x)y1+xy(1+x)^{y}\leq 1+xy for x>0x>0 and y(0,1)y\in(0,1). Using χ2(Pπ||P0)4η2\chi^{2}(P_{\pi}||P_{0})\leq 4\eta^{2} with (36) yields the desired result.

Case 2: Suppose s<ps<\sqrt{p} and ψ(p,s,n)2=snlog(1+ps2)\psi(p,s,n)^{2}=\frac{s}{n}\log\left(1+\frac{p}{s^{2}}\right). Set ρ=1\rho=1. Let π\pi be the prior in which a draw Θπ\Theta\sim\pi is obtained by uniformly drawing S[p]S\subset[p] of size ss and drawing

θi,k{Uniform{cρ,cρ}if iS and k=1,δ0otherwise.\theta_{i,k}\sim\begin{cases}\operatorname{Uniform}\{-c\rho,c\rho\}&\textit{if }i\in S\text{ and }k=1,\\ \delta_{0}&\textit{otherwise}.\end{cases}

The desired result can be proved by arguing in a manner similar to that as in the proof of Theorem 1 (see also [10]). Details are omitted for the sake of brevity. ∎

7.3 Adaptive lower bound

Proof of Theorem 2.

Recall the definitions of 𝒜\mathscr{A}_{\mathcal{H}} in (15) and 𝒱~\widetilde{\mathscr{V}}_{\mathcal{H}} in (20). By assumption, we have 𝒜Llog(e|𝒱~|)\mathscr{A}_{\mathcal{H}}\leq L\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|) for a universal constant L>0L>0. For ease of notation, let us denote ψ=ψadapt(p,s,n)\psi=\psi_{\text{adapt}}(p,s,n) where ψadapt\psi_{\text{adapt}} is given by (19). Set cη:=(16L)1/4(η2log(1+2η2)32L)1/4c_{\eta}:=(16L)^{-1/4}\wedge\left(\frac{\eta^{2}\log\left(1+2\eta^{2}\right)}{32L}\right)^{1/4} and let 0<c<cη0<c<c_{\eta}. We now define a prior π\pi. A draw Θπ\Theta\sim\pi is obtained as follows. First, draw vUniform(𝒱~)v\sim\text{Uniform}(\widetilde{\mathscr{V}}_{\mathcal{H}}). Set s:=min{s[p]:sp𝒜 and v2<ν(s,𝒜)v}s:=\min\left\{s^{\prime}\in[p]:s^{\prime}\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\text{ and }\frac{v}{2}<\nu_{\mathcal{H}}(s^{\prime},\mathscr{A}_{\mathcal{H}})\leq v\right\}. Then draw uniformly at random a subset S[p]S\subset[p] of size exactly ss. Let ds:=2ks𝒱~d_{s}:=2^{k_{s}}\in\widetilde{\mathscr{V}}_{\mathcal{H}} satisfy 2ks1<ν(s,𝒜)2ks2^{k_{s}-1}<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\leq 2^{k_{s}}. Draw independently

θk,j{Uniform{2cρs,2cρs}if jS and 1k<ν(s,𝒜),δ0otherwise.\theta_{k,j}\sim\begin{cases}\operatorname{Uniform}\left\{-\sqrt{2}c\rho_{s},\sqrt{2}c\rho_{s}\right\}&\textit{if }j\in S\text{ and }1\leq k<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}}),\\ \delta_{0}&\textit{otherwise}.\end{cases}

Here, ρs=ψ2/sν(s,𝒜)\rho_{s}=\sqrt{\frac{\psi^{2}/s}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}}. This description concludes the definition of π\pi.

Now we must show that π\pi is indeed supported on s𝒯s\bigcup_{s}\mathscr{T}_{s}. Note that when we draw Θπ\Theta\sim\pi, the associated sparsity level ss always satisfies sp𝒜s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}. Furthermore, conditional on ss we have

ΘF2=2c2ψ2ν(s,𝒜)(ν(s,𝒜)1)c2ψ2.||\Theta||_{F}^{2}=2c^{2}\frac{\psi^{2}}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}(\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})-1)\geq c^{2}\psi^{2}.

Furthermore, consider that for jSj\in S we have

=1θ,j2μ\displaystyle\sum_{\ell=1}^{\infty}\frac{\theta_{\ell,j}^{2}}{\mu_{\ell}} 2c2ρ2<ν(s,𝒜)1μ\displaystyle\leq 2c^{2}\rho^{2}\sum_{\ell<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\frac{1}{\mu_{\ell}}
=2c21ν(s,𝒜)<ν(s,𝒜)ψ2/sμ\displaystyle=2c^{2}\frac{1}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\sum_{\ell<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\frac{\psi^{2}/s}{\mu_{\ell}}
2c21ν(s,𝒜)<ν(s,𝒜)μν(s,𝒜)μ\displaystyle\leq 2c^{2}\frac{1}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\sum_{\ell<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\frac{\mu_{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}}{\mu_{\ell}}
2c2\displaystyle\leq 2c^{2}
1\displaystyle\leq 1

where the last line follows from c<cηc<c_{\eta}. Note we have used the ordering of the eigenvalues as well as Γ(s,𝒜)μν(s,𝒜)1\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\leq\mu_{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})-1} by Lemma 26. Of course, for jSj\not\in S we have Θj=0\Theta_{j}=0. Hence, Θ𝒯s\Theta\in\mathscr{T}_{s} and so π\pi is properly supported.

Writing Pπ=PΘπ(dΘ)P_{\pi}=\int P_{\Theta}\pi(d\Theta) for the mixture, we have

infφ{P0{φ=1}+maxsp𝒜supΘ𝒯s,ΘFcψadapt(p,s,n)PΘ{φ=0}}\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi=1\right\}+\max_{s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}}\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\psi_{\text{adapt}}(p,s,n)\end{subarray}}P_{\Theta}\left\{\varphi=0\right\}\right\} 112χ2(Pπ||P0).\displaystyle\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}\,||\,P_{0})}. (37)

For the following calculations, let Θ,Θiidπ\Theta,\Theta^{\prime}\overset{iid}{\sim}\pi. Let v,v𝒱~v,v^{\prime}\in\widetilde{\mathscr{V}}_{\mathcal{H}} be the corresponding quantities. Let ss and tt denote the corresponding sparsities. Denote the corresponding support sets SS and TT. Also, let {ri,r~i}1ip,\{r_{i\ell},\tilde{r}_{i\ell}\}_{1\leq i\leq p,\ell\in\mathbb{N}} denote an iid collection of Rademacher(1/2)\operatorname{Rademacher}\left(1/2\right) random variables which is independent of s,t,S,s,t,S, and TT. For ease of notation, let νs=ν(s,𝒜)\nu_{s}=\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}}). Likewise, let νt=ν(t,𝒜)\nu_{t}=\nu_{\mathcal{H}}(t,\mathscr{A}_{\mathcal{H}}). By the Ingster-Suslina method (Proposition 5), we have

χ2(Pπ||P0)+1\displaystyle\chi^{2}\left(P_{\pi}\,||\,P_{0}\right)+1 =E(exp(nΘ,ΘF))\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)
=E(exp(2nc2ρsρtiST1<<νsνtrir~i))\displaystyle=E\left(\exp\left(2nc^{2}\rho_{s}\rho_{t}\sum_{i\in S\cap T}\sum_{1<\ell<\nu_{s}\wedge\nu_{t}}r_{i\ell}\tilde{r}_{i\ell}\right)\right)
=E(iST1<<νsνtcosh(2nc2ρsρt))\displaystyle=E\left(\prod_{i\in S\cap T}\prod_{1<\ell<\nu_{s}\wedge\nu_{t}}\cosh\left(2nc^{2}\rho_{s}\rho_{t}\right)\right)
E(exp(2n2c4ρs2ρt2(νsνt)|ST|))\displaystyle\leq E\left(\exp\left(2n^{2}c^{4}\rho_{s}^{2}\rho_{t}^{2}(\nu_{s}\wedge\nu_{t})|S\cap T|\right)\right)
E(exp(2n2c4ρs2ρt2(dsdt)|ST|)).\displaystyle\leq E\left(\exp\left(2n^{2}c^{4}\rho_{s}^{2}\rho_{t}^{2}(d_{s}\wedge d_{t})|S\cap T|\right)\right).

Here, we have used the inequality cosh(x)ex2/2\cosh(x)\leq e^{x^{2}/2} for x>0x>0. Consider that by Lemma 26

2n2c4ρs2ρt2(dsdt)\displaystyle 2n^{2}c^{4}\rho_{s}^{2}\rho_{t}^{2}(d_{s}\wedge d_{t}) =2n2c4Γ(s,𝒜)νsΓ(t,𝒜)νt(dsdt)\displaystyle=2n^{2}c^{4}\cdot\frac{\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}{\nu_{s}}\frac{\Gamma_{\mathcal{H}}(t,\mathscr{A}_{\mathcal{H}})}{\nu_{t}}\cdot(d_{s}\wedge d_{t})
2c4log(1+p𝒜s2)log(1+p𝒜t2)dsdtνsνt\displaystyle\leq 2c^{4}\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{t^{2}}\right)}\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{\nu_{s}\nu_{t}}}
4c4log(1+p𝒜s2)log(1+p𝒜t2)dsdtdsdt\displaystyle\leq 4c^{4}\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{t^{2}}\right)}\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}
4c4p𝒜s2p𝒜t2dsdtdsdt\displaystyle\leq 4c^{4}\sqrt{\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\cdot\frac{p\mathscr{A}_{\mathcal{H}}}{t^{2}}}\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}
8c4log(1+p𝒜st)dsdtdsdt\displaystyle\leq 8c^{4}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}

We have used s,tp𝒜s,t\geq\sqrt{p\mathscr{A}_{\mathcal{H}}} along with the inequality u2log(1+u)u\frac{u}{2}\leq\log(1+u)\leq u for 0u10\leq u\leq 1. Therefore,

E(exp(nΘ,ΘF))E(exp(8c4log(1+p𝒜st)dsdtdsdt|ST|)).\displaystyle E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)\leq E\left(\exp\left(8c^{4}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}|S\cap T|\right)\right). (38)

Since 8c4dsdtdsdt18c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\leq 1, we can use Lemma 28 and the inequality (1+x)δ1+δx(1+x)^{\delta}\leq 1+\delta x for x>0x>0 and δ1\delta\leq 1 to obtain

χ2(Pπ||P0)+1\displaystyle\chi^{2}\left(P_{\pi}\,||\,P_{0}\right)+1 E(exp(8c4dsdtdsdtlog(1+p𝒜st)|ST|))\displaystyle\leq E\left(\exp\left(8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)|S\cap T|\right)\right)
E((1sp+spexp(8c4log(1+p𝒜st)dsdtdsdt))t)\displaystyle\leq E\left(\left(1-\frac{s}{p}+\frac{s}{p}\exp\left(8c^{4}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\right)\right)^{t}\right)
=E((1sp+sp(1+p𝒜st)8c4dsdtdsdt)t)\displaystyle=E\left(\left(1-\frac{s}{p}+\frac{s}{p}\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)^{8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}}\right)^{t}\right)
E((1+1t8c4dsdtdsdt𝒜)t)\displaystyle\leq E\left(\left(1+\frac{1}{t}\cdot 8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\mathscr{A}_{\mathcal{H}}\right)^{t}\right)
E(exp(8c4dsdtdsdt𝒜)).\displaystyle\leq E\left(\exp\left(8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\mathscr{A}_{\mathcal{H}}\right)\right).

Recall we write ds=2ksd_{s}=2^{k_{s}} and dt=2ktd_{t}=2^{k_{t}}. Moreover, recall ds,dt𝒱~d_{s},d_{t}\in\widetilde{\mathscr{V}}_{\mathcal{H}}. Now observe

E(exp(8c4dsdtdsdt𝒜))=E(exp(8c42|kskt|2𝒜)).E\left(\exp\left(8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\mathscr{A}_{\mathcal{H}}\right)\right)=E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\right).

Recall we have 𝒜Llog(e|𝒱~|)\mathscr{A}_{\mathcal{H}}\leq L\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|). Define the sets

E1\displaystyle E_{1} :={(ks,kt)×:2ks,2kt𝒱~ and |kskt|η22e8Lc4log2(e|𝒱~|)},\displaystyle:=\left\{(k_{s},k_{t})\in\mathbb{N}\times\mathbb{N}:2^{k_{s}},2^{k_{t}}\in\widetilde{\mathscr{V}}_{\mathcal{H}}\text{ and }|k_{s}-k_{t}|\leq\frac{\eta^{2}}{2e^{8Lc^{4}}}\log_{2}(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\right\},
E2\displaystyle E_{2} :={(ks,kt)×:2ks,2kt𝒱~ and |kskt|>η22e8Lc4log2(e|𝒱~|)}.\displaystyle:=\left\{(k_{s},k_{t})\in\mathbb{N}\times\mathbb{N}:2^{k_{s}},2^{k_{t}}\in\widetilde{\mathscr{V}}_{\mathcal{H}}\text{ and }|k_{s}-k_{t}|>\frac{\eta^{2}}{2e^{8Lc^{4}}}\log_{2}(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\right\}.

Then

E(exp(8c42|kskt|2𝒜))\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\right)
=E(exp(8c42|kskt|2𝒜)𝟙{(ks,kt)E1})+E(exp(8c42|kskt|2𝒜)𝟙{(ks,kt)E2}).\displaystyle=E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{1}\}}\right)+E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{2}\}}\right).

Let us examine the second term. Note that for any >0\ell>0 we have xlog(x)(e)1x^{-\ell}\log(x)\leq(e\ell)^{-1} for all x>0x>0. Using this, we obtain

E(exp(8c42|kskt|2𝒜)𝟙{(ks,kt)E2})\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{2}\}}\right) exp(8Lc4(e|𝒱~|)η24e8Lc4𝒜)\displaystyle\leq\exp\left(8Lc^{4}\left(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|\right)^{-\frac{\eta^{2}}{4e^{8Lc^{4}}}}\mathscr{A}_{\mathcal{H}}\right)
exp(8Lc4(e|𝒱~|)η24e8Lc4log(e|𝒱~|))\displaystyle\leq\exp\left(8Lc^{4}\left(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|\right)^{-\frac{\eta^{2}}{4e^{8Lc^{4}}}}\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\right)
exp(8Lc44e8Lc41η2)\displaystyle\leq\exp\left(8Lc^{4}\cdot\frac{4e^{8Lc^{4}-1}}{\eta^{2}}\right)
exp(32Lc4η2)\displaystyle\leq\exp\left(\frac{32Lc^{4}}{\eta^{2}}\right)
2η2+1.\displaystyle\leq 2\eta^{2}+1.

The final line results from c4<cη4132Lη2log(1+2η2)c^{4}<c_{\eta}^{4}\leq\frac{1}{32L}\eta^{2}\log\left(1+2\eta^{2}\right). Note we have also used 8Lc4108Lc^{4}-1\leq 0 to obtain the penultimate line. We now examine E1E_{1}. Consider that |E1|η22e8Lc4|𝒱~|log2(e|𝒱~|)|E_{1}|\leq\frac{\eta^{2}}{2e^{8Lc^{4}}}|\widetilde{\mathscr{V}}_{\mathcal{H}}|\log_{2}(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|). Therefore,

E(exp(8c42|kskt|2𝒜)𝟙{(ks,kt)E1})\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{1}\}}\right) =1|𝒱~|2v,v𝒱~exp(8c42|kskt|2𝒜)𝟙{(2ks,2kt)E1}\displaystyle=\frac{1}{|\widetilde{\mathscr{V}}_{\mathcal{H}}|^{2}}\sum_{v,v^{\prime}\in\widetilde{\mathscr{V}}_{\mathcal{H}}}\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(2^{k_{s}},2^{k_{t}})\in E_{1}\}}
|E1||𝒱~|2exp(8c4𝒜)\displaystyle\leq\frac{|E_{1}|}{|\widetilde{\mathscr{V}}_{\mathcal{H}}|^{2}}\exp\left(8c^{4}\mathscr{A}_{\mathcal{H}}\right)
η22e8Lc4|𝒱~|log2(e|𝒱~|)|𝒱~|2exp(8Lc4log(e|𝒱~|))\displaystyle\leq\frac{\frac{\eta^{2}}{2e^{8Lc^{4}}}|\widetilde{\mathscr{V}}_{\mathcal{H}}|\log_{2}(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)}{|\widetilde{\mathscr{V}}_{\mathcal{H}}|^{2}}\cdot\exp\left(8Lc^{4}\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\right)
η22log2(e|𝒱~|)|𝒱~|18Lc4\displaystyle\leq\frac{\eta^{2}}{2}\cdot\frac{\log_{2}(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)}{|\widetilde{\mathscr{V}}_{\mathcal{H}}|^{1-8Lc^{4}}}
η2+η22log2log(|𝒱~|)|𝒱~|18Lc4\displaystyle\leq\eta^{2}+\frac{\eta^{2}}{2\log 2}\cdot\frac{\log(|\widetilde{\mathscr{V}}_{\mathcal{H}}|)}{|\widetilde{\mathscr{V}}_{\mathcal{H}}|^{1-8Lc^{4}}}
η2+η22log2\displaystyle\leq\eta^{2}+\frac{\eta^{2}}{2\log 2}
2η2.\displaystyle\leq 2\eta^{2}.

Note we have used 𝒜Llog(e|𝒱~|)\mathscr{A}_{\mathcal{H}}\leq L\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|). We have also used 8Lc4<8Lcη4128Lc^{4}<8Lc_{\eta}^{4}\leq\frac{1}{2} along with the inequality log(x)x\log(x)\leq\sqrt{x} for all x>0x>0 to obtain the penultimate line. Therefore, we have shown

χ2(Pπ||P0)2η2+(2η2+1)1=4η2.\chi^{2}(P_{\pi}\,||\,P_{0})\leq 2\eta^{2}+\left(2\eta^{2}+1\right)-1=4\eta^{2}.

Plugging into (37) yields

infφ{P0{φ=1}+maxsp𝒜supΘ𝒯s,ΘFcψadapt(p,s,n)PΘ{φ=0}}1η.\inf_{\varphi}\left\{P_{0}\left\{\varphi=1\right\}+\max_{s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}}\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\psi_{\text{adapt}}(p,s,n)\end{subarray}}P_{\Theta}\left\{\varphi=0\right\}\right\}\geq 1-\eta.

The proof is complete. ∎

7.4 Adaptive upper bound

Proof of Theorem 3.

Fix η(0,1)\eta\in(0,1). In various places of the proof, we will point out that CηC_{\eta} can be taken sufficiently large to obtain desired bounds, so for now let C>CηC>C_{\eta}. Let LL^{*} denote the universal constant from Lemma 21. Let DD denote the maximum of the corresponding universal constants from Lemmas 15 and 11. Set

K2\displaystyle K_{2} :=11(log2)1/4(2c)1/4(Lc)1/4,\displaystyle:=1\vee\frac{1}{(\log 2)^{1/4}}\vee\left(\frac{2}{c}\right)^{1/4}\vee\left(\frac{L^{*}}{c}\right)^{1/4},
K3\displaystyle K_{3} :=LK22,\displaystyle:=\frac{L^{*}}{K_{2}^{2}},
K2\displaystyle K_{2}^{\prime} :=1log2(2c)1/2(cK32)1/4.\displaystyle:=\frac{1}{\sqrt{\log 2}}\vee\left(\frac{2}{c^{\prime}}\right)^{1/2}\vee(c^{\prime}K_{3}^{2})^{-1/4}.

Here, c:=ccc:=c^{*}\wedge c^{**} with cc^{*} and cc^{**} being the universal constants in the exponential terms of Lemmas 15 and 14 respectively. Likewise, c:=(c)(c)c^{\prime}:=(c^{\prime})^{*}\wedge(c^{\prime})^{**} where (c)(c^{\prime})^{*} and (c)(c^{\prime})^{**} are the universal constants in the exponential terms of Lemmas 11 and 10. Note that these choices of K2,K3,K2K_{2},K_{3},K_{2}^{\prime} are almost identical to the choices in the proofs of Propositions 2 and 3. The only modifications are the terms (2/c)1/4(2/c)^{1/4} and (2/c)1/4(2/c^{\prime})^{1/4}, and the utility of this modification will become clear through the course of the proof.

For any ν\nu, define

𝒮bulk\displaystyle\mathcal{S}_{\text{bulk}} :={s𝒮:s<p𝒜 and log(1+p𝒜s2)K3dν},\displaystyle:=\left\{s\in\mathscr{S}:s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\text{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\leq K_{3}\sqrt{d_{\nu}}\right\},
𝒮tail\displaystyle\mathcal{S}_{\text{tail}} :={s𝒮:s<p𝒜 and log(1+p𝒜s2)>K3dν}.\displaystyle:=\left\{s\in\mathscr{S}:s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\text{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}>K_{3}\sqrt{d_{\nu}}\right\}.

We examine the Type I and Type II errors separately. Focusing on the Type I error, union bound yields

P0{maxν𝒱maxs𝒮φν,s=1}\displaystyle P_{0}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=1\right\}
(ν𝒱s𝒮bulkP0{φν,s=1})+(ν𝒱s𝒮tailP0{φν,s=1})+(ν𝒱P0{φν,p=1}).\displaystyle\leq\left(\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\right)+\left(\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\right)+\left(\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P_{0}\left\{\varphi_{\nu,p}=1\right\}\right). (39)

We bound each term separately.

Type I error: Bulk

For s𝒮bulks\in\mathcal{S}_{\text{bulk}},

P0{φν,s=1}\displaystyle P_{0}\left\{\varphi_{\nu,s}=1\right\} =P0{Trν,s(dν)>Csνlog(1+p𝒜s2)}.\displaystyle=P_{0}\left\{T_{r_{\nu,s}}(d_{\nu})>Cs\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\right\}.

Consider that Lemma 15 gives us the following. If we select x>0x>0 satisfying

C(prν,s4ecrν,s2dx+dνrν,s2x)Csνlog(1+p𝒜s2),C^{*}\left(\sqrt{pr_{\nu,s}^{4}e^{-c^{*}\frac{r_{\nu,s}^{2}}{d}}x}+\frac{d_{\nu}}{r_{\nu,s}^{2}}x\right)\leq Cs\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}, (40)

then we have P0{φν,s=1}exP_{0}\left\{\varphi_{\nu,s}=1\right\}\leq e^{-x}. Let us select

x=C4(C(C)2)1(2K24D)(2D/K22)(p𝒜2s2slog(1+p𝒜s2)).x=\frac{C}{4(C^{*}\vee(C^{*})^{2})}\cdot\frac{1}{(2K_{2}^{4}\lceil D\rceil)\vee(\sqrt{2\lceil D\rceil}/K_{2}^{2})}\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right).

To verify (40), consider that cK242c^{*}K_{2}^{4}\geq 2 by our choice of K2K_{2}, and so

prν,s4ecrν,s2dx\displaystyle\sqrt{pr_{\nu,s}^{4}e^{-c^{*}\frac{r_{\nu,s}^{2}}{d}}x} =pK24dνlog(1+p𝒜s2)s4(s2+p𝒜)2x\displaystyle=\sqrt{pK_{2}^{4}d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\frac{s^{4}}{(s^{2}+p\mathscr{A}_{\mathcal{H}})^{2}}x}
C2C12Dsdνlog(1+p𝒜s2)\displaystyle\leq\frac{\sqrt{C}}{2C^{*}}\frac{1}{\sqrt{2\lceil D\rceil}}s\sqrt{d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}
C2Csνlog(1+p𝒜s2).\displaystyle\leq\frac{\sqrt{C}}{2C^{*}}s\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}.

Here, we have used that dν=νDd_{\nu}=\nu\vee\lceil D\rceil. We have also used C1C\geq 1, which holds provided we select CηC_{\eta} large enough (i.e. Cη1C_{\eta}\geq 1).

Likewise, consider

dνrν,s2x\displaystyle\frac{d_{\nu}}{r_{\nu,s}^{2}}x =dνK22dνlog(1+p𝒜s2)x\displaystyle=\frac{d_{\nu}}{K_{2}^{2}\sqrt{d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}}x
C42DCsdνlog(1+p𝒜s2)\displaystyle\leq\frac{C}{4\sqrt{2\lceil D\rceil}C^{*}}s\sqrt{d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}
C4Csνlog(1+p𝒜s2).\displaystyle\leq\frac{C}{4C^{*}}s\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}.

Therefore, (40) is satisfied, and so we have

P0{φν,s=1}exp(Cκ(p𝒜2s2slog(1+p𝒜s2)))P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)

where κ\kappa is the universal constant κ=14(C(C)2)1(2K24D)(2/K22)\kappa=\frac{1}{4(C^{*}\vee(C^{*})^{2})}\cdot\frac{1}{(2K_{2}^{4}\lceil D\rceil)\vee(\sqrt{2}/K_{2}^{2})}. With this bound in hand, observe

ν𝒱s𝒮bulkP0{φν,s=1}\displaystyle\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}
ν𝒱s𝒮bulkexp(Cκ(p𝒜2s2slog(1+p𝒜s2)))\displaystyle\leq\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)
|𝒱|k{0}:2k<p𝒜exp(Cκ(p𝒜222k2klog(1+p𝒜22k)))\displaystyle\leq|\mathscr{V}_{\mathcal{H}}|\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\wedge 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)\right)
e2𝒜k{0}:2k<p𝒜exp(Cκp𝒜222k)+e2𝒜k{0}:2k<p𝒜exp(Cκ2klog(1+p𝒜22k)).\displaystyle\leq e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\right)+e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right).

Here we have used log(e|𝒱|)2𝒜\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}} from Lemma 3. We bound each term separately. First, consider

e2𝒜k{0}:2k<p𝒜exp(Cκp𝒜222k)\displaystyle e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\right) =e2𝒜k:2k<p𝒜exp(Cκ𝒜(p𝒜2k)2)\displaystyle=e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\mathscr{A}_{\mathcal{H}}\cdot\left(\frac{\sqrt{p\mathscr{A}_{\mathcal{H}}}}{2^{k}}\right)^{2}\right)
k=0exp((Cκ2)𝒜4k)\displaystyle\leq\sum_{k=0}^{\infty}\exp\left(-(C\kappa-2)\mathscr{A}_{\mathcal{H}}4^{k}\right)
η12\displaystyle\leq\frac{\eta}{12}

provided we take CηC_{\eta} sufficiently large.

Likewise, consider

e2𝒜k{0}:2k<p𝒜exp(Cκ2klog(1+p𝒜22k))\displaystyle e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right) e2𝒜k{0}:2k<p𝒜exp(Cκ22klog(1+p𝒜22k)Cκ2𝒜)\displaystyle\leq e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-\frac{C\kappa}{2}2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)-\frac{C\kappa}{2}\mathscr{A}_{\mathcal{H}}\right)
e(Cκ22)𝒜k{0}:2k<p𝒜exp(Cκ22klog(1+p𝒜22k))\displaystyle\leq e^{-\left(\frac{C\kappa}{2}-2\right)\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-\frac{C\kappa}{2}2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)
η12\displaystyle\leq\frac{\eta}{12}

again, provided we take CηC_{\eta} sufficiently large. We have used that 𝒜\mathscr{A}_{\mathcal{H}} exhibits at most logarithmic growth in pp, i.e. we use the crude bound 𝒜log(ep)\mathscr{A}_{\mathcal{H}}\leq\log(ep). Therefore, we have established

ν𝒱s𝒮bulkP0{φν,s=1}η6.\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\frac{\eta}{6}. (41)

Type I error: Tail

For s𝒮tails\in\mathcal{S}_{\text{tail}}, we have

P0{φν,s=1}=P0{Trs(dν)>Cslog(1+p𝒜s2)}.P_{0}\left\{\varphi_{\nu,s}=1\right\}=P_{0}\left\{T_{r_{s}^{\prime}}(d_{\nu})>Cs\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right\}.

Consider that Lemma 11 gives us an analogue to (40), namely if we select x>0x>0 satisfying

C(p(rs)4ec(rs)2x+x)Cnτadapt2(p,s,n)2,C^{**}\left(\sqrt{p(r_{s}^{\prime})^{4}e^{-c^{**}(r_{s}^{\prime})^{2}}x}+x\right)\leq Cn\tau^{2}_{\text{adapt}}(p,s,n)^{2}, (42)

then we have P0{φν,s=1}exP_{0}\left\{\varphi_{\nu,s}=1\right\}\leq e^{-x}. Let us select

x=C412((C)(C)2)((K2)41)(p𝒜2s2slog(1+p𝒜s2)).x=\frac{C}{4}\cdot\frac{1}{2\left((C^{**})\vee(C^{**})^{2}\right)\left((K_{2}^{\prime})^{4}\vee 1\right)}\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right).

To verify (42), consider that c(K2)22c^{**}(K_{2}^{\prime})^{2}\geq 2 by our choice of K2K_{2}^{\prime}, and so

p(rs)4ec(rs)2x\displaystyle\sqrt{p(r_{s}^{\prime})^{4}e^{-c^{**}(r_{s}^{\prime})^{2}}x} =(K2)2log(1+p𝒜s2)ps4(s2+p𝒜)2x\displaystyle=(K_{2}^{\prime})^{2}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\sqrt{p\frac{s^{4}}{(s^{2}+p\mathscr{A}_{\mathcal{H}})^{2}}x}
C12Clog(1+p𝒜s2)ps4(s2+p𝒜)2p𝒜2s2\displaystyle\leq\sqrt{C}\frac{1}{2C^{**}}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\sqrt{p\frac{s^{4}}{(s^{2}+p\mathscr{A}_{\mathcal{H}})^{2}}\cdot\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}}
C2Cslog(1+p𝒜s2).\displaystyle\leq\frac{C}{2C^{**}}s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right).

Here, we have used C1C\geq 1. Likewise, consider that

xC2Cslog(1+p𝒜s2).x\leq\frac{C}{2C^{**}}\cdot s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right).

Therefore, (42) is satisfied, and so we have

P0{φν,s=1}exp(Cκ(p𝒜2s2slog(1+p𝒜s2)))P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)

where κ\kappa is the universal constant κ=1412((C)(C)2)\kappa=\frac{1}{4}\cdot\frac{1}{2\left((C^{**})\vee(C^{**})^{2}\right)}. With this bound in hand, observe

ν𝒱s𝒮tailP0{φν,s=1}\displaystyle\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}
ν𝒱s𝒮tailexp(Cκ(p𝒜2s2slog(1+p𝒜s2)))\displaystyle\leq\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)
|𝒱|k{0}:2k<p𝒜exp(Cκ(p𝒜222k2klog(1+p𝒜22k)))\displaystyle\leq|\mathscr{V}_{\mathcal{H}}|\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\wedge 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)\right)
e2𝒜k{0}:2k<p𝒜exp(Cκp𝒜222k)+e2𝒜k{0}:2k<p𝒜exp(Cκ2klog(1+p𝒜22k)).\displaystyle\leq e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\right)+e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right).

Here we have used log(e|𝒱|)2𝒜\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}} from Lemma 3. From here, the same argument from the bulk case can be employed to conclude

ν𝒱s𝒮tailP0{φν,s=1}η6\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\frac{\eta}{6} (43)

provided CηC_{\eta} is taken sufficiently large.

Type I error: Dense

Let us now bound ν𝒱P0{φν,p=1}\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P_{0}\left\{\varphi_{\nu,p}=1\right\}. Consider that for any ν𝒱\nu\in\mathscr{V}_{\mathcal{H}}, we have

P0{φν,p=1}\displaystyle P_{0}\left\{\varphi_{\nu,p}=1\right\} =P0{nj=1pkνXk,j2>Cνp𝒜}\displaystyle=P_{0}\left\{n\sum_{j=1}^{p}\sum_{k\leq\nu}X_{k,j}^{2}>C\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\right\}
=P{χνp2>Cνp𝒜}\displaystyle=P\left\{\chi^{2}_{\nu p}>C\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\right\}
P{χνp2>C2(νp𝒜+𝒜)}\displaystyle\leq P\left\{\chi^{2}_{\nu p}>\frac{C}{2}\left(\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}+\mathscr{A}_{\mathcal{H}}\right)\right\}
P{χνp2>2νpC4𝒜+2C4𝒜}\displaystyle\leq P\left\{\chi^{2}_{\nu p}>2\sqrt{\nu p\cdot\frac{C}{4}\mathscr{A}_{\mathcal{H}}}+2\cdot\frac{C}{4}\mathscr{A}_{\mathcal{H}}\right\}
eC4𝒜.\displaystyle\leq e^{-\frac{C}{4}\mathscr{A}_{\mathcal{H}}}.

To obtain the fourth line, we have used 𝒜log(ep)\mathscr{A}_{\mathcal{H}}\leq\log(ep) which implies νp𝒜𝒜\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\geq\mathscr{A}_{\mathcal{H}}. In the above display, we have also used C21\frac{C}{2}\geq 1 (which holds provided we take CηC_{\eta} large enough) to obtain the penultimate line and Lemma 16 to obtain the final line. Consequently,

ν𝒱P0{φν,p=1}|𝒱|eC4𝒜e(C42)𝒜η6\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P_{0}\left\{\varphi_{\nu,p}=1\right\}\leq|\mathscr{V}_{\mathcal{H}}|e^{-\frac{C}{4}\mathscr{A}_{\mathcal{H}}}\leq e^{-\left(\frac{C}{4}-2\right)\mathscr{A}_{\mathcal{H}}}\leq\frac{\eta}{6} (44)

since log(e|𝒱|)2𝒜\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}} by Lemma 3 and C42log(6/η)\frac{C}{4}-2\geq\log(6/\eta), which holds provided we take CηC_{\eta} sufficiently large.

Putting together (41), (43), (44) into (7.4) yields the Type I error bound

P0{maxν𝒱maxs𝒮φν,s=1}η6+η6+η6=η2.P_{0}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=1\right\}\leq\frac{\eta}{6}+\frac{\eta}{6}+\frac{\eta}{6}=\frac{\eta}{2}. (45)

Type II error: We now examine the Type II error. Suppose s[p]s^{*}\in[p]. Let fsf\in\mathcal{F}_{s^{*}} with f2Cτadapt(p,s,n)||f||_{2}\geq C\tau_{\text{adapt}}(p,s^{*},n). We proceed by considering various cases.

Type II error: Dense

Suppose sp𝒜s^{*}\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}. Let ν~\tilde{\nu} denote the smallest element in 𝒱\mathscr{V}_{\mathcal{H}} which is greater than or equal to ν(s,𝒜)\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}}). Then

Pf{maxν𝒱maxs𝒮φν,s=0}\displaystyle P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=0\right\} Pf{φν~,p=0}=Pf{nj=1pkν~Xk,j2>ν~p+Cν~p𝒜}.\displaystyle\leq P_{f}\left\{\varphi_{\tilde{\nu},p}=0\right\}=P_{f}\left\{n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}X_{k,j}^{2}>\tilde{\nu}p+C\sqrt{\tilde{\nu}p\mathscr{A}_{\mathcal{H}}}\right\}.

Consider that

nj=1pkν~Xk,j2χpν~2(nj=1pkν~θk,j2)n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}X_{k,j}^{2}\sim\chi^{2}_{p\tilde{\nu}}\left(n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\right)

where the collection {θk,j}\{\theta_{k,j}\} denotes the collection of basis coefficients for ff. We will also use the matrix Θ\Theta to denote the collection of basis coefficients; note that Θ𝒯s\Theta\in\mathscr{T}_{s^{*}}. Let SS^{*} denote the set of jj for which Θj\Theta_{j} are nonzero. Observe

C2τadapt2(p,s,n)\displaystyle C^{2}\tau_{\text{adapt}}^{2}(p,s^{*},n) f22\displaystyle\leq||f||_{2}^{2}
=j=1pk=1θk,j2\displaystyle=\sum_{j=1}^{p}\sum_{k=1}^{\infty}\theta_{k,j}^{2}
jS(kν~θk,j2+k>ν~θk,j2)\displaystyle\leq\sum_{j\in S^{*}}\left(\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}+\sum_{k>\tilde{\nu}}\theta_{k,j}^{2}\right)
jS(kν~θk,j2+μν~k>ν~θk,j2μk)\displaystyle\leq\sum_{j\in S^{*}}\left(\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}+\mu_{\tilde{\nu}}\sum_{k>\tilde{\nu}}\frac{\theta_{k,j}^{2}}{\mu_{k}}\right)
(jSkν~θk,j2)+sμν~\displaystyle\leq\left(\sum_{j\in S^{*}}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\right)+s^{*}\mu_{\tilde{\nu}}
=(j=1pkν~θk,j2)+sμν~.\displaystyle=\left(\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\right)+s^{*}\mu_{\tilde{\nu}}.

Therefore,

nj=1pkν~θk,j2\displaystyle n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2} C2nτadapt2(p,s,n)nsμν~\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{*},n)-ns^{*}\mu_{\tilde{\nu}}
C2nτadapt2(p,s,n)sν~log(1+p𝒜(s)2)\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{*},n)-s^{*}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}
C2nτadapt2(p,s,n)2nτadapt2(p,s,n)\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{*},n)-\sqrt{2}n\tau_{\text{adapt}}^{2}(p,s^{*},n)
(C22)nτadapt2(p,s,n)\displaystyle\geq(C^{2}-\sqrt{2})n\tau_{\text{adapt}}^{2}(p,s^{*},n)

We have used that ν~2ν(s,𝒜)\tilde{\nu}\leq 2\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}}) to obtain the third line. Therefore, we have

nj=1pkν~θk,j2(C22)nτadapt2(p,s,n)C222ν~p𝒜.n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\geq\left(C^{2}-\sqrt{2}\right)n\tau_{\text{adapt}}^{2}(p,s^{*},n)\geq\frac{C^{2}-\sqrt{2}}{\sqrt{2}}\sqrt{\tilde{\nu}p\mathscr{A}_{\mathcal{H}}}.

The signal magnitude satisfies the requisite strength needed to successfully detect, which can be seen by following the argument of Proposition 4. Hence, we can achieve Type II error less than or equal to η2\frac{\eta}{2} by selecting CηC_{\eta} sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by η\eta, as desired.

Type II error: Bulk

Suppose s<p𝒜s^{*}<\sqrt{p\mathscr{A}_{\mathcal{H}}} and log(1+p𝒜(s)2)K3ν(s,𝒜)D\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}\leq K_{3}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\vee\lceil D\rceil}. Let ν~\tilde{\nu} be the smallest element in 𝒱\mathscr{V}_{\mathcal{H}} greater than or equal to ν(s,𝒜)\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}}). Let s~\tilde{s} be the smallest element in 𝒮\mathscr{S} greater than or equal to ss^{*}. By the definitions of these grids, we have ν~/2<ν(s,𝒜)ν~\tilde{\nu}/2<\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\leq\tilde{\nu} and ss~2ss^{*}\leq\tilde{s}\leq 2s^{*}. Then

Pf{maxν𝒱maxs𝒮φν,s=0}Pf{φν~,s~=0}.P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=0\right\}\leq P_{f}\left\{\varphi_{\tilde{\nu},\tilde{s}}=0\right\}.

With these choices, we have

log(1+p𝒜s~2)K3dν~.\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}\leq K_{3}\sqrt{d_{\tilde{\nu}}}.

Note that since s~s\tilde{s}\geq s^{*}, we have fs~f\in\mathcal{F}_{\tilde{s}}. Following argument similar to those in the proof of Proposition 2, it can be seen that the necessary signal strength to successfully detect is of squared order

s~ν~log(1+p𝒜s~2)nsν(s,𝒜)log(1+p𝒜(s)2)n.\frac{\tilde{s}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}}{n}\asymp\frac{s^{*}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}}{n}.

This is precisely the order of τadapt2(p,s,n)\tau^{2}_{\text{adapt}}(p,s^{*},n), and so we can obtain Type II error less than η2\frac{\eta}{2} by choosing CηC_{\eta} sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by η\eta, as desired.

Type II error: Tail

Suppose s<p𝒜s^{*}<\sqrt{p\mathscr{A}_{\mathcal{H}}} and log(1+p𝒜(s)2)>K3ν(s,𝒜)D\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}>K_{3}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\vee\lceil D\rceil}. Let ν~\tilde{\nu} and s~\tilde{s} be as defined in the previous case, i.e. Type II error analysis for the bulk case. As before, we have

Pf{maxν𝒱maxs𝒮φν,s=0}Pf{φν~,s~=0}.P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=0\right\}\leq P_{f}\left\{\varphi_{\tilde{\nu},\tilde{s}}=0\right\}.

Now, consider that ss~s^{*}\leq\tilde{s} and so fs~f\in\mathcal{F}_{\tilde{s}}.

Suppose we have

log(1+p𝒜s~2)>K3dν~.\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}>K_{3}\sqrt{d_{\tilde{\nu}}}.

We can follow arguments similar to those in the proof of Proposition 3 to see that the necessary signal strength to successfully detect is of squared order

s~log(1+p𝒜s~2)nslog(1+p𝒜(s)2)n.\frac{\tilde{s}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}{n}\asymp\frac{s^{*}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}{n}.

This is precisely the order of τadapt2(p,s,n)\tau^{2}_{\text{adapt}}(p,s^{*},n) and so we can obtain Type II error less than η2\frac{\eta}{2} by choosing CηC_{\eta} sufficiently large.

On the other hand, suppose we have

log(1+p𝒜s~2)K3dν~\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}\leq K_{3}\sqrt{d_{\tilde{\nu}}}

As argued in the previous section about the bulk regime, the necessary signal strength to successfully detect is of squared order

s~ν~log(1+p𝒜s~2)nsν(s,𝒜)log(1+p𝒜(s)2)n.\frac{\tilde{s}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}}{n}\asymp\frac{s^{*}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}}{n}.

Consider that

τadapt2(p,s,n)=slog(1+p𝒜(s)2)nsν(s,𝒜)log(1+p𝒜(s)2)n\displaystyle\tau^{2}_{\text{adapt}}(p,s^{*},n)=\frac{s^{*}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}{n}\gtrsim\frac{s^{*}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}}{n}

and so we can obtain Type II error less than η2\frac{\eta}{2} by choosing CηC_{\eta} sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by η\eta, as desired. ∎

7.5 Smoothness and sparsity adaptive lower bounds

Proof of Theorem 5.

Fix η(0,1)\eta\in(0,1). Fix any sp1/2+δloglogns\geq p^{1/2+\delta}\sqrt{\log\log n}. Through the course of the proof, we will note where we must take cηc_{\eta} suitably small enough, so for now let 0<c<cη0<c<c_{\eta}. For each ss, define the geometric grid

𝒱s:={2k:k and (nsploglog(np))24α1+12k(nsploglog(np))24α0+1}.\mathcal{V}_{s}:=\left\{2^{k}:k\in\mathbb{N}\text{ and }\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha_{1}+1}}\leq 2^{k}\leq\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha_{0}+1}}\right\}.

Note log|𝒱s|loglog(np)\log|\mathcal{V}_{s}|\asymp\log\log(np).

We now define a prior π\pi which is supported on the alternative hypothesis. Let νUniform(𝒱s)\nu\sim\text{Uniform}(\mathcal{V}_{s}). Let α=α(ν,s)\alpha=\alpha(\nu,s) denote the solution to

ν=(nsploglog(np))24α+1.\nu=\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha+1}}.

Note that α(ν,s)[α0,α1]\alpha(\nu,s)\in[\alpha_{0},\alpha_{1}] for any ν𝒱s\nu\in\mathcal{V}_{s}. Note α\alpha is random since ν\nu is random. Draw uniformly at random a subset S[p]S\subset[p] of size ss. Draw independently

θk,j{Uniform{cρν,cρν}if jS and kν,δ0otherwise.\theta_{k,j}\sim\begin{cases}\operatorname{Uniform}\left\{-c\rho_{\nu},c\rho_{\nu}\right\}&\textit{if }j\in S\text{ and }k\leq\nu,\\ \delta_{0}&\textit{otherwise}.\end{cases}

Here, ρν\rho_{\nu} is given by ρν:=(nsploglog(np))2α+14α+1\rho_{\nu}:=\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{2\alpha+1}{4\alpha+1}}. Having defined ρν\rho_{\nu}, the definition of the prior π\pi is complete.

Now we must show π\pi is indeed supported on the alternative hypothesis. Consider that for Θπ\Theta\sim\pi, we have

ΘF2=c2sρν2ν=c2s(nsploglog(np))4α4α+1=c2τdense2(p,s,n,α).\displaystyle||\Theta||_{F}^{2}=c^{2}s\rho_{\nu}^{2}\nu=c^{2}s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}=c^{2}\tau_{\text{dense}}^{2}(p,s,n,\alpha).

Furthermore, consider that for iSi\in S we have by definition of α=α(ν,s)\alpha=\alpha(\nu,s),

=1θi,22α=c2ρν2ν2αν2αν2αc2ρν2ν2α+1=c2ρν2(nsploglog(np))4α+24α+1c21.\sum_{\ell=1}^{\infty}\frac{\theta_{i,\ell}^{2}}{\ell^{-2\alpha}}=c^{2}\rho_{\nu}^{2}\nu^{2\alpha}\sum_{\ell\leq\nu}\frac{\ell^{2\alpha}}{\nu^{2\alpha}}\leq c^{2}\rho_{\nu}^{2}\nu^{2\alpha+1}=c^{2}\rho_{\nu}^{2}\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{4\alpha+2}{4\alpha+1}}\leq c^{2}\leq 1.

We can take cη1c_{\eta}\leq 1 to ensure c21c^{2}\leq 1. Of course, for iSci\in S^{c} we have Θi=0\Theta_{i}=0. Hence, we have shown Θ𝒯(s,α)\Theta\in\mathscr{T}(s,\alpha) with probability one, and so π\pi has the proper support.

Writing Pπ=PΘπ(dΘ)P_{\pi}=\int P_{\Theta}\pi(d\Theta) to denote the mixture induced by the prior π\pi, we have

infφ{P0{φ0}+maxs~p1/2+δsupα[α0,α1]supf(s~,α),f2cτdense(p,s~,n,α)Pf{φ1}}\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{\tilde{s}\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(\tilde{s},\alpha),\\ ||f||_{2}\geq c\tau_{\text{dense}}(p,\tilde{s},n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\} 112χ2(Pπ||P0).\displaystyle\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}\,||\,P_{0})}.

For the following calculations, let Θ,Θiidπ\Theta,\Theta^{\prime}\overset{iid}{\sim}\pi. Let ν,ν𝒱s\nu,\nu^{\prime}\in\mathcal{V}_{s} be the corresponding quantities and let α,α\alpha,\alpha^{\prime} denote the corresponding smoothness levels. Further let S,SS,S^{\prime} denote the corresponding support sets. Note both are of size ss. Let {ri,ri}1ip,\{r_{i\ell},r^{\prime}_{i\ell}\}_{1\leq i\leq p,\ell\in\mathbb{N}} denote an iid collection of Rademacher(1/2)\operatorname{Rademacher}(1/2) random variables which is independent of all the other random variables. By the Ingster-Suslina method (Proposition 5), we have

χ2(Pπ||P0)+1\displaystyle\chi^{2}(P_{\pi}\,||\,P_{0})+1 =E(exp(nΘ,ΘF))\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)
=E(exp(c2nρνρνiSS=1ννriri))\displaystyle=E\left(\exp\left(c^{2}n\rho_{\nu}\rho_{\nu^{\prime}}\sum_{i\in S\cap S^{\prime}}\sum_{\ell=1}^{\nu\wedge\nu^{\prime}}r_{i\ell}r_{i\ell}^{\prime}\right)\right)
E(exp(c42n2ρν2ρν2(νν)|SS|))\displaystyle\leq E\left(\exp\left(\frac{c^{4}}{2}n^{2}\rho_{\nu}^{2}\rho_{\nu^{\prime}}^{2}(\nu\wedge\nu^{\prime})|S\cap S^{\prime}|\right)\right)
E(exp(c42n2ρν4νρν4ννννν|SS|)).\displaystyle\leq E\left(\exp\left(\frac{c^{4}}{2}n^{2}\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}\cdot\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}|S\cap S^{\prime}|\right)\right).

From the definition of α(ν,s)\alpha(\nu,s), we have

ρν4νρν4ν\displaystyle\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}} =((nsploglog(np))8α+44α+1+24α+1(nsploglog(np))8α+44α+1+24α+1)1/2\displaystyle=\left(\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{8\alpha+4}{4\alpha+1}+\frac{2}{4\alpha+1}}\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{8\alpha^{\prime}+4}{4\alpha^{\prime}+1}+\frac{2}{4\alpha^{\prime}+1}}\right)^{1/2}
=(ploglog(np)ns)2\displaystyle=\left(\frac{\sqrt{p\log\log(np)}}{ns}\right)^{2}
2log(1+ploglog(np)s2)n2.\displaystyle\leq\frac{2\log\left(1+\frac{p\log\log(np)}{s^{2}}\right)}{n^{2}}.

Here, we have used sploglog(np)s\geq\sqrt{p\log\log(np)} since sp1/2+δs\geq p^{1/2+\delta} and sploglogns\geq\sqrt{p\log\log n}. We have also used the inequality x/2log(1+x)x/2\leq\log(1+x) for 0x10\leq x\leq 1. With this in hand, it follows that

χ2(Pπ||P0)+1\displaystyle\chi^{2}(P_{\pi}\,||\,P_{0})+1 E(exp(c4log(1+ploglog(np)s2)νννν|SS|))\displaystyle\leq E\left(\exp\left(c^{4}\cdot\log\left(1+\frac{p\log\log(np)}{s^{2}}\right)\cdot\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}|S\cap S^{\prime}|\right)\right)
E((1sp+spec4(νννν)log(1+ploglog(np)s2))s)\displaystyle\leq E\left(\left(1-\frac{s}{p}+\frac{s}{p}e^{c^{4}\left(\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\right)\log\left(1+\frac{p\log\log(np)}{s^{2}}\right)}\right)^{s}\right)
E(exp(c4ννννloglog(np))).\displaystyle\leq E\left(\exp\left(c^{4}\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\log\log(np)\right)\right).

Noting that we can write ν=2k\nu=2^{k}, ν=2k\nu^{\prime}=2^{k^{\prime}} and that |𝒱s|log(np)|\mathcal{V}_{s}|\asymp\log(np), we can follow the same steps as in the proof of Proposition 4.2 in [16]. Taking cηc_{\eta} suitably small, we obtain χ2(Pπ||P0)4η2\chi^{2}(P_{\pi}\,||\,P_{0})\leq 4\eta^{2} which yields

infφ{P0{φ0}+maxs~p1/2+δsupα[α0,α1]supf(s~,α),f2cτadapt(p,s~,n,α)Pf{φ1}}1η\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{\tilde{s}\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(\tilde{s},\alpha),\\ ||f||_{2}\geq c\tau_{\text{adapt}}(p,\tilde{s},n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

as desired. ∎

Proof of Theorem 6.

Some simplification is convenient. Note τsparse\tau_{\text{sparse}} can be rewritten as

τsparse2(p,s,n,α){s(nlog(ploglogn))4α4α+1if log(ploglogn)n12α+1,slog(ploglogn)nif log(ploglogn)>n12α+1.\tau^{2}_{\text{sparse}}(p,s,n,\alpha)\asymp\begin{cases}s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log(p\log\log n)\leq n^{\frac{1}{2\alpha+1}},\\ \frac{s\log(p\log\log n)}{n}&\textit{if }\log(p\log\log n)>n^{\frac{1}{2\alpha+1}}.\end{cases}

Note that in the case log(ploglogn)>n12α+1\log(p\log\log n)>n^{\frac{1}{2\alpha+1}}, we have logpn12α+1\log p\gtrsim n^{\frac{1}{2\alpha+1}}. Therefore, log(ploglogn)logp\log(p\log\log n)\asymp\log p and so we can further simplify

τsparse2(p,s,n,α){s(nlog(ploglogn))4α4α+1if log(ploglogn)n12α+1,slogpnif log(ploglogn)>n12α+1.\tau^{2}_{\text{sparse}}(p,s,n,\alpha)\asymp\begin{cases}s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log(p\log\log n)\leq n^{\frac{1}{2\alpha+1}},\\ \frac{s\log p}{n}&\textit{if }\log(p\log\log n)>n^{\frac{1}{2\alpha+1}}.\end{cases} (46)

Writing τsparse\tau_{\text{sparse}} in the form (46) is convenient. The lower bound slogpn\frac{s\log p}{n} is exactly the minimax lower bound and so no new argument is needed. All that needs to be proved is the lower bound s(n/log(ploglogn))4α4α+1s\left(n/\sqrt{\log(p\log\log n)}\right)^{-\frac{4\alpha}{4\alpha+1}}.

The proof is very similar to that of the proof of Theorem 5, so we only point out the modifications in the interest of brevity. Fix any s<p1/2δs<p^{1/2-\delta}. Let π\pi be the prior from the proof of Theorem 5, but use 𝒱sparse\mathcal{V}_{\text{sparse}} given by (47) and α(ν)\alpha(\nu) given by (48), defined below, instead. Define the geometric grid

𝒱sparse:={2k:k and (nlog(ploglogn))24α1+12k(nlog(ploglogn))24α0+1}\mathcal{V}_{\text{sparse}}:=\left\{2^{k}:k\in\mathbb{N}\text{ and }\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{\frac{2}{4\alpha_{1}+1}}\leq 2^{k}\leq\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{\frac{2}{4\alpha_{0}+1}}\right\} (47)

Let α(ν)\alpha(\nu) denote the solution to

ν=(nlog(ploglogn))24α+1.\nu=\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{\frac{2}{4\alpha+1}}. (48)

Note α(ν)[α0,α1]\alpha(\nu)\in[\alpha_{0},\alpha_{1}] for any ν𝒱sparse\nu\in\mathcal{V}_{\text{sparse}}. Also, use

ρν=(nlog(ploglogn))2α+14α+1.\rho_{\nu}=\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{2\alpha+1}{4\alpha+1}}.

It can be checked in the same manner that π\pi with these modifications is properly supported on the alternative hypothesis. We can continue along as in the proof of Theorem 5 up until the point we have

ρν4νρν4ν=(log(ploglogn)n)2.\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}=\left(\frac{\sqrt{\log(p\log\log n)}}{n}\right)^{2}.

From here, we use the fact that sp12δs\leq p^{\frac{1}{2}-\delta} implies log(ploglogn)log(1+ploglogns2)\log(p\log\log n)\asymp\log\left(1+\frac{p\log\log n}{s^{2}}\right). In other words, there exists a constant κ>0\kappa>0 such that

ρν4νρν4νκlog(1+ploglogns2)n2.\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}\leq\kappa\frac{\log\left(1+\frac{p\log\log n}{s^{2}}\right)}{n^{2}}.

Then by taking cηc_{\eta} sufficiently small, the rest of the proof of Theorem 5 can be carried out to obtain the desired result. ∎

7.6 Smoothness and sparsity adaptive upper bounds

Proof of Theorem 4.

Fix η(0,1)\eta\in(0,1). For ease, let us just write 𝒱=𝒱test\mathcal{V}=\mathcal{V}_{\text{test}}. We will note throughout the proof where we take CηC_{\eta} suitably large, so for now let C>CηC>C_{\eta}. We will also note where we take KηK_{\eta} suitably large. We will also note where we take KηK_{\eta} suitably large. We first examine the Type I error. By union bound and taking Kη1K_{\eta}\geq 1,

P0{maxν𝒱φν=1}\displaystyle P_{0}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=1\right\} ν𝒱P0{φν=1}\displaystyle\leq\sum_{\nu\in\mathcal{V}}P_{0}\left\{\varphi_{\nu}=1\right\}
=ν𝒱P{χνp2νp+Kη(νploglog(np)+loglog(np))}\displaystyle=\sum_{\nu\in\mathcal{V}}P\left\{\chi^{2}_{\nu p}\geq\nu p+K_{\eta}\left(\sqrt{\nu p\log\log(np)}+\log\log(np)\right)\right\}
ν𝒱P{χνp2νp+2(νpKη4loglog(np)+Kη4loglog(np))}\displaystyle\leq\sum_{\nu\in\mathcal{V}}P\left\{\chi^{2}_{\nu p}\geq\nu p+2\left(\sqrt{\nu p\frac{K_{\eta}}{4}\log\log(np)}+\frac{K_{\eta}}{4}\log\log(np)\right)\right\}
|𝒱|eKη4loglog(np)\displaystyle\leq|\mathcal{V}|e^{-\frac{K_{\eta}}{4}\log\log(np)}
e(Kη4κ)loglog(np)\displaystyle\leq e^{-\left(\frac{K_{\eta}}{4}-\kappa\right)\log\log(np)}

for some universal positive constant κ\kappa. Here, we have used |𝒱|log(np)|\mathcal{V}|\asymp\log(np). Taking KηK_{\eta} suitably large, the Type I error is suitably bounded

P0{maxν𝒱φν=1}η2.P_{0}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=1\right\}\leq\frac{\eta}{2}.

We now examine the Type II error. Fix any sp1/2+δs^{*}\geq p^{1/2+\delta}, α[α0,α1]\alpha^{*}\in[\alpha_{0},\alpha_{1}], and f(s,α)f^{*}\in\mathcal{F}(s^{*},\alpha^{*}) with f2Cτdense(p,s,n,α)||f^{*}||_{2}\geq C\tau_{\text{dense}}(p,s^{*},n,\alpha^{*}). Set

ν=(nsploglog(np))24α+1.\nu^{*}=\left(\frac{ns^{*}}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha^{*}+1}}.

Let ν𝒱\nu\in\mathcal{V} be the smallest element larger than ν\nu^{*}. Note ν/2νν\nu/2\leq\nu^{*}\leq\nu by definition of 𝒱\mathcal{V}.

Pf{maxν~𝒱φν~=0}\displaystyle P_{f^{*}}\left\{\max_{\tilde{\nu}\in\mathcal{V}}\varphi_{\tilde{\nu}}=0\right\}
Pf{φν=0}\displaystyle\leq P_{f^{*}}\left\{\varphi_{\nu}=0\right\}
=Pf{χνp2(nf22)νp+Kη(pνloglog(np)+loglog(np))}.\displaystyle=P_{f^{*}}\left\{\chi^{2}_{\nu p}(n||f^{*}||_{2}^{2})\leq\sqrt{\nu p}+K_{\eta}\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}.

Consider that since ν(nsploglog(np))24α+1\nu\geq\left(\frac{ns^{*}}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha^{*}+1}} and sp1/2+δs^{*}\geq p^{1/2+\delta}, it immediately follows that ν\nu grows polynomially in nn, and so νloglogn\nu\gtrsim\log\log n. Therefore, we have νploglog(np)1κloglog(np)\sqrt{\nu p\log\log(np)}\geq\frac{1}{\kappa^{\prime}}\log\log(np) for some universal positive constant κ\kappa^{\prime}. Then by Chebyshev’s inequality

Pf{χνp2(nf22)νp+Kη(pνloglog(np)+loglog(np))}\displaystyle P_{f^{*}}\left\{\chi^{2}_{\nu p}(n||f^{*}||_{2}^{2})\leq\nu p+K_{\eta}\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}
Pf{χνp2(nf22)νp+Kη(1+κ)(pνloglog(np)+loglog(np))}\displaystyle\leq P_{f^{*}}\left\{\chi^{2}_{\nu p}(n||f^{*}||_{2}^{2})\leq\nu p+K_{\eta}(1+\kappa^{\prime})\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}
=Pf{(nf22Kη(1+κ)pνloglog(np))νp+nf22χνp2(nf22)}\displaystyle=P_{f^{*}}\left\{(n||f^{*}||_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)})\leq\nu p+n||f^{*}||_{2}^{2}-\chi^{2}_{\nu p}(n||f^{*}||_{2}^{2})\right\}
Var(χνp2(nf22))(nf22Kη(1+κ)pνloglog(np))2\displaystyle\leq\frac{\operatorname{Var}\left(\chi^{2}_{\nu p}(n||f^{*}||_{2}^{2})\right)}{\left(n||f^{*}||_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}
2νp(nf22Kη(1+κ)pνloglog(np))2+4nf22(nf22Kη(1+κ)pνloglog(np))2.\displaystyle\leq\frac{2\nu p}{\left(n||f^{*}||_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}+\frac{4n||f^{*}||_{2}^{2}}{\left(n||f^{*}||_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}.

Consider that

Kη(1+κ)pνloglog(np)\displaystyle K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)} Kη(1+κ)2νploglog(np)\displaystyle\leq K_{\eta}(1+\kappa^{\prime})\sqrt{2\nu^{*}p\log\log(np)}
Kη(1+κ)2ns(nsploglog(np))4α4α+1\displaystyle\leq K_{\eta}(1+\kappa^{\prime})\sqrt{2}ns^{*}\left(\frac{ns^{*}}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha^{*}}{4\alpha^{*}+1}}
=2Kη(1+κ)nτdense2(p,s,n,α)\displaystyle=\sqrt{2}K_{\eta}(1+\kappa^{\prime})n\tau_{\text{dense}}^{2}(p,s^{*},n,\alpha^{*})
2Kη(1+κ)C2nf22.\displaystyle\leq\frac{\sqrt{2}K_{\eta}(1+\kappa^{\prime})}{C^{2}}n||f^{*}||_{2}^{2}.

Therefore, taking CηC_{\eta} sufficiently large, we have

Pf{maxν~𝒱φν~=0}\displaystyle P_{f^{*}}\left\{\max_{\tilde{\nu}\in\mathcal{V}}\varphi_{\tilde{\nu}}=0\right\}
2νp(nf22Kη(1+κ)pνloglog(np))2+4nf22(nf22Kη(1+κ)pνloglog(np))2\displaystyle\leq\frac{2\nu p}{\left(n||f^{*}||_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}+\frac{4n||f^{*}||_{2}^{2}}{\left(n||f^{*}||_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}
2(C22Kη(1+κ))2loglog(np)+4(12Kη(1+κ)C2)nf22\displaystyle\leq\frac{2}{\left(\frac{C^{2}}{\sqrt{2}}-K_{\eta}(1+\kappa^{\prime})\right)^{2}\log\log(np)}+\frac{4}{(1-\frac{\sqrt{2}K_{\eta}(1+\kappa^{\prime})}{C^{2}})n||f^{*}||_{2}^{2}}
2(C22Kη(1+κ))2loglog(np)+4(12Kη(1+κ)C2)C22\displaystyle\leq\frac{2}{\left(\frac{C^{2}}{\sqrt{2}}-K_{\eta}(1+\kappa^{\prime})\right)^{2}\log\log(np)}+\frac{4}{(1-\frac{\sqrt{2}K_{\eta}(1+\kappa^{\prime})}{C^{2}})\frac{C^{2}}{\sqrt{2}}}
η2.\displaystyle\leq\frac{\eta}{2}.

Hence, the Type II error is bounded suitably. Since ff^{*} was arbitrary, we have shown that

P0{maxν𝒱φν=1}+maxsp1/2+δsupα[α0,α1]supf(s,α),f2Cτdense(p,s,n,α)Pf{maxν𝒱φν=0}η\displaystyle P_{0}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=1\right\}+\max_{s\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq C\tau_{\text{dense}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=0\right\}\leq\eta

as desired.

References

  • [1] Ery Arias-Castro, Emmanuel J. Candès and Yaniv Plan “Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism” In Ann. Statist. 39.5, 2011, pp. 2533–2556 DOI: 10.1214/11-AOS910
  • [2] Yannick Baraud “Non-asymptotic minimax rates of testing in signal detection” In Bernoulli 8.5, 2002, pp. 577–606
  • [3] Sohom Bhattacharya, Jianqing Fan and Debarghya Mukherjee “Deep neural networks for nonparametric interaction models with diverging dimension” In Ann. Statist., Forthcoming
  • [4] Lucien Birgé and Pascal Massart “Gaussian model selection” In J. Eur. Math. Soc. (JEMS) 3.3, 2001, pp. 203–268 DOI: 10.1007/s100970100031
  • [5] Lawrence D. Brown and Mark G. Low “Asymptotic equivalence of nonparametric regression and white noise” In Ann. Statist. 24.6, 1996, pp. 2384–2398 DOI: 10.1214/aos/1032181159
  • [6] M.. Burnasev “Minimax detection of an imperfectly known signal against a background of Gaussian white noise” In Teor. Veroyatnost. i Primenen. 24.1, 1979, pp. 106–118
  • [7] Alexandra Carpentier and Nicolas Verzelen “Adaptive estimation of the sparsity in the Gaussian vector model” In Ann. Statist. 47.1, 2019, pp. 93–126 DOI: 10.1214/17-AOS1680
  • [8] Julien Chhor, Rajarshi Mukherjee and Subhabrata Sen “Sparse signal detection in heteroscedastic Gaussian sequence models: sharp minimax rates” In Bernoulli 30.3, 2024, pp. 2127–2153 DOI: 10.3150/23-bej1667
  • [9] O. Collier, L. Comminges and A.. Tsybakov “On estimation of nonsmooth functionals of sparse normal means” In Bernoulli 26.3, 2020, pp. 1989–2020 DOI: 10.3150/19-BEJ1180
  • [10] Olivier Collier, Laëtitia Comminges and Alexandre B. Tsybakov “Minimax estimation of linear and quadratic functionals on sparsity classes” In Ann. Statist. 45.3, 2017, pp. 923–958 DOI: 10.1214/15-AOS1432
  • [11] Olivier Collier, Laëtitia Comminges, Alexandre B. Tsybakov and Nicolas Verzelen “Optimal adaptive estimation of linear functionals under sparsity” In Ann. Statist. 46.6A, 2018, pp. 3130–3150 DOI: 10.1214/17-AOS1653
  • [12] L. Comminges, O. Collier, M. Ndaoud and A.. Tsybakov “Adaptive robust estimation in sparse vector model” In Ann. Statist. 49.3, 2021, pp. 1347–1377 DOI: 10.1214/20-aos2002
  • [13] Nabarun Deb, Rajarshi Mukherjee, Sumit Mukherjee and Ming Yuan “Detecting structured signals in Ising models” In Ann. Appl. Probab. 34.1A, 2024, pp. 1–45 DOI: 10.1214/23-aap1929
  • [14] David Donoho and Jiashun Jin “Higher criticism for detecting sparse heterogeneous mixtures” In Ann. Statist. 32.3, 2004, pp. 962–994 DOI: 10.1214/009053604000000265
  • [15] M.. Ermakov “Minimax detection of a signal in Gaussian white noise” In Teor. Veroyatnost. i Primenen. 35.4, 1990, pp. 704–715 DOI: 10.1137/1135098
  • [16] Chao Gao, Fang Han and Cun-Hui Zhang “On estimation of isotonic piecewise constant signals” In Ann. Statist. 48.2, 2020, pp. 629–654 DOI: 10.1214/18-AOS1792
  • [17] Ghislaine Gayraud and Yuri Ingster “Detection of sparse additive functions” In Electron. J. Stat. 6, 2012, pp. 1409–1448 DOI: 10.1214/12-EJS715
  • [18] Peter Hall and Jiashun Jin “Innovated higher criticism for detecting sparse signals in correlated noise” In Ann. Statist. 38.3, 2010, pp. 1686–1732 DOI: 10.1214/09-AOS764
  • [19] Yanjun Han, Jiantao Jiao and Rajarshi Mukherjee “On estimation of Lr{L}_{r}-norms in Gaussian white noise models” In Probab. Theory Related Fields 177.3-4, 2020, pp. 1243–1294 DOI: 10.1007/s00440-020-00982-x
  • [20] Trevor Hastie and Robert Tibshirani “Generalized additive models” In Statist. Sci. 1.3, 1986, pp. 297–318
  • [21] Yu.. Ingster “Adaptive tests for minimax testing of nonparametric hypotheses” In Probability theory and mathematical statistics, Vol. I (Vilnius, 1989) “Mokslas”, Vilnius, 1990, pp. 539–549
  • [22] Yu.. Ingster “Asymptotically minimax testing of nonparametric hypotheses” In Probability theory and mathematical statistics, Vol. I (Vilnius, 1985) VNU Sci. Press, Utrecht, 1987, pp. 553–574
  • [23] Yu.. Ingster “Minimax nonparametric detection of signals in white Gaussian noise” In Problemy Peredachi Informatsii 18.2, 1982, pp. 61–73
  • [24] Yu.. Ingster and I.. Suslina “Nonparametric Goodness-of-Fit Testing Under Gaussian Models” 169, Lecture Notes in Statistics Springer-Verlag, New York, 2003 DOI: 10.1007/978-0-387-21580-8
  • [25] Yuri Ingster and Oleg Lepski “Multichannel nonparametric signal detection” In Math. Methods Statist. 12.3, 2003, pp. 247–275
  • [26] Yuri I. Ingster, Alexandre B. Tsybakov and Nicolas Verzelen “Detection boundary in sparse regression” In Electron. J. Stat. 4, 2010, pp. 1476–1526 DOI: 10.1214/10-EJS589
  • [27] Iain M Johnstone “Gaussian estimation: Sequence and wavelet models”
  • [28] Iain M. Johnstone “Chi-square oracle inequalities” In State of the art in probability and statistics (Leiden, 1999) 36, IMS Lecture Notes Monogr. Ser. Inst. Math. Statist., Beachwood, OH, 2001, pp. 399–418 DOI: 10.1214/lnms/1215090080
  • [29] Vladimir Koltchinskii and Ming Yuan “Sparsity in multiple kernel learning” In Ann. Statist. 38.6, 2010, pp. 3660–3695 DOI: 10.1214/10-AOS825
  • [30] Subhodh Kotekal and Chao Gao “Minimax rates for sparse signal detection under correlation” In Inf. Inference 12.4, 2023, pp. Paper No. iaad044\bibrangessep97 DOI: 10.1093/imaiai/iaad044
  • [31] B. Laurent and P. Massart “Adaptive estimation of a quadratic functional by model selection” In Ann. Statist. 28.5, 2000, pp. 1302–1338 DOI: 10.1214/aos/1015957395
  • [32] O. Lepski, A. Nemirovski and V. Spokoiny “On estimation of the Lr{L}_{r} norm of a regression function” In Probab. Theory Related Fields 113.2, 1999, pp. 221–253 DOI: 10.1007/s004409970006
  • [33] Yi Lin and Hao Helen Zhang “Component selection and smoothing in multivariate nonparametric regression” In Ann. Statist. 34.5, 2006, pp. 2272–2297 DOI: 10.1214/009053606000000722
  • [34] Haoyang Liu, Chao Gao and Richard J. Samworth “Minimax rates in sparse, high-dimensional change point detection” In Ann. Statist. 49.2, 2021, pp. 1081–1112 DOI: 10.1214/20-aos1994
  • [35] Lukas Meier, Sara Geer and Peter Bühlmann “High-dimensional additive modeling” In Ann. Statist. 37.6B, 2009, pp. 3779–3821 DOI: 10.1214/09-AOS692
  • [36] Rajarshi Mukherjee and Gourab Ray “On testing for parameters in Ising models” In Ann. Inst. Henri Poincaré Probab. Stat. 58.1, 2022, pp. 164–187 DOI: 10.1214/21-aihp1157
  • [37] Rajarshi Mukherjee and Subhabrata Sen “On minimax exponents of sparse testing” arXiv:2003.00570 [math, stat] arXiv, 2020 DOI: 10.48550/arXiv.2003.00570
  • [38] Garvesh Raskutti, Martin J. Wainwright and Bin Yu “Minimax-optimal rates for sparse additive models over kernel classes via convex programming” In J. Mach. Learn. Res. 13, 2012, pp. 389–427
  • [39] Pradeep Ravikumar, John Lafferty, Han Liu and Larry Wasserman “Sparse additive models” In J. R. Stat. Soc. Ser. B Stat. Methodol. 71.5, 2009, pp. 1009–1030 DOI: 10.1111/j.1467-9868.2009.00718.x
  • [40] Markus Reiß “Asymptotic equivalence for nonparametric regression with multivariate and random design” In Ann. Statist. 36.4, 2008, pp. 1957–1982 DOI: 10.1214/07-AOS525
  • [41] V.. Spokoiny “Adaptive hypothesis testing using wavelets” In Ann. Statist. 24.6, 1996, pp. 2477–2498 DOI: 10.1214/aos/1032181163
  • [42] N.. Temme “The asymptotic expansion of the incomplete gamma functions” In SIAM J. Math. Anal. 10.4, 1979, pp. 757–766 DOI: 10.1137/0510071
  • [43] N.. Temme “Uniform asymptotic expansions of the incomplete gamma functions and the incomplete beta function” In Math. Comp. 29.132, 1975, pp. 1109–1114 DOI: 10.2307/2005750
  • [44] A.. Tsybakov “Pointwise and sup-norm sharp adaptive estimation of functions on the Sobolev classes” In Ann. Statist. 26.6, 1998, pp. 2420–2469 DOI: 10.1214/aos/1024691478
  • [45] Alexandre B. Tsybakov “Introduction to Nonparametric Estimation”, Springer Series in Statistics Springer, New York, 2009 DOI: 10.1007/b13794
  • [46] Roman Vershynin “High-Dimensional Probability” 47, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2018 DOI: 10.1017/9781108231596
  • [47] Martin J. Wainwright “High-Dimensional Statistics” 48, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2019 DOI: 10.1017/9781108627771
  • [48] Yuting Wei and Martin J. Wainwright “The local geometry of testing in ellipses: tight control via localized Kolmogorov widths” In IEEE Trans. Inform. Theory 66.8, 2020, pp. 5110–5129 DOI: 10.1109/TIT.2020.2981313
  • [49] Yun Yang and Surya T. Tokdar “Minimax-optimal nonparametric regression in high dimensions” In Ann. Statist. 43.2, 2015, pp. 652–674 DOI: 10.1214/14-AOS1289
  • [50] Ming Yuan and Ding-Xuan Zhou “Minimax optimal rates of estimation in high dimensional additive models” In Ann. Statist. 44.6, 2016, pp. 2564–2593 DOI: 10.1214/15-AOS1422
  • [51] Anru R. Zhang and Yuchen Zhou “On the non-asymptotic and sharp lower tail bounds of random variables” In Stat 9, 2020, pp. e314\bibrangessep11 DOI: 10.1002/sta4.314

Appendix A Hard thresholding

In this section, we collect results about the random variable (Z2αt(d))𝟙{Z2d+t2}(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}} for ZN(θ,Id)Z\sim N(\theta,I_{d}) and αt(d)\alpha_{t}(d) defined in (13). We consider the tail t2dt^{2}\gtrsim d and the bulk t2dt^{2}\lesssim d separately.

The proof outlines are similar to those employed in [34]. However, they only consider d=1d=1, whereas we need to consider the general dd case. Consequently, a much more careful analysis is required.

A.1 Tail

Lemma 8 (Moment generating function).

Suppose c~\widetilde{c} is a positive universal constant. There exist universal positive constants D,C~,C,D,\widetilde{C},C, and cc such that if dD,t2c~d,d\geq D,t^{2}\geq\widetilde{c}d, and 0<λ<12C~0<\lambda<\frac{1}{2\widetilde{C}}, then we have

E(eλY)exp(Ct4λ2ect2)E\left(e^{\lambda Y}\right)\leq\exp\left(Ct^{4}\lambda^{2}e^{-ct^{2}}\right)

where Y=(Z2αt(d))𝟙{Z2d+t2}Y=\left(||Z||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\left\{||Z||^{2}\geq d+t^{2}\right\}}, ZN(0,Id)Z\sim N(0,I_{d}), and where αt(d)\alpha_{t}(d) is given by (13).

Proof.

We follow the broad approach of the argument presented in the proof of Lemma 18 in [34]. The universal positive constant DD will be chosen later on in the proof, so for now let dDd\geq D and ZN(0,Id)Z\sim N(0,I_{d}). Note we will also take DD large enough so that Lemma 21 is applicable. Since E(Y)=0E(Y)=0, we have

E(eλY)=E(1+λY+(eλY1λY))=1+E(eλY1λY).E(e^{\lambda Y})=E(1+\lambda Y+(e^{\lambda Y}-1-\lambda Y))=1+E(e^{\lambda Y}-1-\lambda Y).

Consider that for any yy\in\mathbb{R} we have

ey1y{(y)y2if y<0,y2if 0y1,eyif y>1.e^{y}-1-y\leq\begin{cases}(-y)\wedge y^{2}&\textit{if }y<0,\\ y^{2}&\textit{if }0\leq y\leq 1,\\ e^{y}&\textit{if }y>1.\end{cases}

Therefore,

E(eλY1λY)λ2E(Y2𝟙{Y<0})+λ2E(Y2𝟙{0Y1/λ})+E(eλY𝟙{Y>1/λ}).E(e^{\lambda Y}-1-\lambda Y)\leq\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})+\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0\leq Y\leq 1/\lambda\}})+E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}}). (49)

Each term will be bounded separately. Considering the first term, note that Lemma 22 asserts αt(d)d+C(t2d)\alpha_{t}(d)\leq d+C^{*}(t^{2}\vee d) for a universal constant CC^{*}. Note also that αt(d)d+t2\alpha_{t}(d)\geq d+t^{2}. Therefore,

λ2E(Y2𝟙{Y<0})\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}}) =λ2E((Z2αt(d))2𝟙{αt(d)>Z2d+t2})\displaystyle=\lambda^{2}E((||Z||^{2}-\alpha_{t}(d))^{2}\mathbbm{1}_{\{\alpha_{t}(d)>||Z||^{2}\geq d+t^{2}\}})
λ2(αt(d)dt2)2P{αt(d)>Z2d+t2}\displaystyle\leq\lambda^{2}(\alpha_{t}(d)-d-t^{2})^{2}P\left\{\alpha_{t}(d)>||Z||^{2}\geq d+t^{2}\right\}
λ2(d+C(t2d)dt2)2P{Z2d+t2}\displaystyle\leq\lambda^{2}(d+C^{*}(t^{2}\vee d)-d-t^{2})^{2}P\left\{||Z||^{2}\geq d+t^{2}\right\}
2Cλ2(d2t4)exp(cmin(t4d,t2))\displaystyle\leq 2C^{*}\lambda^{2}(d^{2}\vee t^{4})\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
=2Cλ2t4exp(ct2)\displaystyle=2C^{*}\lambda^{2}t^{4}\exp\left(-ct^{2}\right)

where cc is a universal positive constant and CC^{*} remains a universal constant but whose value may change from line to line. Note we have also used Corollary 1 in the above display as well as t2c~dt^{2}\geq\widetilde{c}d. To summarize, we have shown

λ2E(Y2𝟙{Y<0})2Cλ2t4exp(ct2).\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})\leq 2C^{*}\lambda^{2}t^{4}\exp\left(-ct^{2}\right). (50)

We now bound the second term in (49). Let fdf_{d} denote the probability density function of the χd2\chi^{2}_{d} distribution. We have

λ2E(Y2𝟙{0<Y<1/λ})\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0<Y<1/\lambda\}}) λ2E(Y2𝟙{0<Y})\displaystyle\leq\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0<Y\}})
=λ2E((Z2αt(d))2𝟙{Z2>αt(d)})\displaystyle=\lambda^{2}E((||Z||^{2}-\alpha_{t}(d))^{2}\mathbbm{1}_{\{||Z||^{2}>\alpha_{t}(d)\}})
=λ2αt(d)(zαt(d))2fd(z)𝑑z\displaystyle=\lambda^{2}\int_{\alpha_{t}(d)}^{\infty}(z-\alpha_{t}(d))^{2}f_{d}(z)\,dz
2λ2αt(d)(z2+αt(d)2)fd(z)𝑑z.\displaystyle\leq 2\lambda^{2}\int_{\alpha_{t}(d)}^{\infty}\left(z^{2}+\alpha_{t}(d)^{2}\right)f_{d}(z)\,dz.

An application of Lemma 18 yields

αt(d)z2fd(z)𝑑z\displaystyle\int_{\alpha_{t}(d)}^{\infty}z^{2}f_{d}(z)\,dz =d(d+2)P{χd+42αt(d)}\displaystyle=d(d+2)P\left\{\chi^{2}_{d+4}\geq\alpha_{t}(d)\right\}
3d2P{χd+42d+4+(αt(d)d4)}\displaystyle\leq 3d^{2}P\left\{\chi^{2}_{d+4}\geq d+4+(\alpha_{t}(d)-d-4)\right\}
3d2P{χd+42d+4+(t24)}\displaystyle\leq 3d^{2}P\left\{\chi^{2}_{d+4}\geq d+4+(t^{2}-4)\right\}
6d2exp(cmin((t24)2d+4,(t24)))\displaystyle\leq 6d^{2}\exp\left(-c\min\left(\frac{(t^{2}-4)^{2}}{d+4},(t^{2}-4)\right)\right)
6d2exp(cmin(t420d,t22))\displaystyle\leq 6d^{2}\exp\left(-c\min\left(\frac{t^{4}}{20d},\frac{t^{2}}{2}\right)\right)
6d2exp(cmin(t4d,t2))\displaystyle\leq 6d^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
=6d2exp(ct2)\displaystyle=6d^{2}\exp\left(-ct^{2}\right)

where cc remains a universal constant but has a value which may change from line to line. Note we have used αt(d)d+t2\alpha_{t}(d)\geq d+t^{2}, Corollary 1, and t2c~dt^{2}\geq\widetilde{c}d. Likewise, consider that

αt(d)αt(d)2fd(z)𝑑z\displaystyle\int_{\alpha_{t}(d)}^{\infty}\alpha_{t}(d)^{2}f_{d}(z)\,dz =αt(d)2P{χd2αt(d)}\displaystyle=\alpha_{t}(d)^{2}P\{\chi^{2}_{d}\geq\alpha_{t}(d)\}
αt(d)2P{χd2d+t2}\displaystyle\leq\alpha_{t}(d)^{2}P\{\chi^{2}_{d}\geq d+t^{2}\}
2(d+Ct2)2exp(cmin(t4d,t2))\displaystyle\leq 2(d+C^{*}t^{2})^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
Ct4exp(cmin(t4d,t2))\displaystyle\leq C^{*}t^{4}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
=Ct4exp(ct2)\displaystyle=C^{*}t^{4}\exp\left(-ct^{2}\right)

where we have used (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}, t2c~dt^{2}\geq\widetilde{c}d, Lemma 22, and Corollary 1. Again, CC^{*} and cc remain universal constants but have values which may change from line to line. Thus, we have shown

λ2E(Y2𝟙{Y<0})Cλ2t4exp(ct2).\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})\leq C^{*}\lambda^{2}t^{4}\exp\left(-ct^{2}\right). (51)

We now bound the final term in (49). Consider that for 0λ<120\leq\lambda<\frac{1}{2} we have

E(eλY𝟙{Y>1/λ})=αt(d)+1/λeλ(zαt(d))fd(z)𝑑z=eλαt(d)αt(d)+1/λ12d/2Γ(d/2)e(12λ)zzd21𝑑z.E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}})=\int_{\alpha_{t}(d)+1/\lambda}^{\infty}e^{\lambda(z-\alpha_{t}(d))}f_{d}(z)\,dz=e^{-\lambda\alpha_{t}(d)}\int_{\alpha_{t}(d)+1/\lambda}^{\infty}\frac{1}{2^{d/2}\Gamma(d/2)}e^{-\left(\frac{1}{2}-\lambda\right)z}z^{\frac{d}{2}-1}\,dz.

Let u=(12λ)zu=\left(\frac{1}{2}-\lambda\right)z. Then du=(12λ)dzdu=\left(\frac{1}{2}-\lambda\right)dz and so

eλαt(d)αt(d)+1/λ12d/2Γ(d/2)e(12λ)zzd21𝑑z\displaystyle e^{-\lambda\alpha_{t}(d)}\int_{\alpha_{t}(d)+1/\lambda}^{\infty}\frac{1}{2^{d/2}\Gamma(d/2)}e^{-\left(\frac{1}{2}-\lambda\right)z}z^{\frac{d}{2}-1}\,dz =eλαt(d)2d/2Γ(d/2)(αt(d)+1/λ)(1/2λ)euud21(12λ)d2𝑑u\displaystyle=\frac{e^{-\lambda\alpha_{t}(d)}}{2^{d/2}\Gamma(d/2)}\int_{\left(\alpha_{t}(d)+1/\lambda\right)(1/2-\lambda)}^{\infty}e^{-u}u^{\frac{d}{2}-1}\left(\frac{1}{2}-\lambda\right)^{-\frac{d}{2}}\,du
=eλαt(d)(12λ)d/21Γ(d2)(αt(d)+1/λ)(1/2λ)euud21𝑑u\displaystyle=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{1}{\Gamma\left(\frac{d}{2}\right)}\int_{(\alpha_{t}(d)+1/\lambda)(1/2-\lambda)}^{\infty}e^{-u}u^{\frac{d}{2}-1}\,du
=eλαt(d)(12λ)d/2Γ(d2,(1λ+αt(d))(12λ))Γ(d2)\displaystyle=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}

where Γ(s,x)=xetts1𝑑t\Gamma(s,x)=\int_{x}^{\infty}e^{-t}t^{s-1}\,dt. To summarize, we have shown

E(eλY𝟙{Y>1/λ})=eλαt(d)(12λ)d/2Γ(d2,(1λ+αt(d))(12λ))Γ(d2).E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}})=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}. (52)

To continue, we would like to apply Corollary 2, but we must first verify the conditions. Let a=d2a=\frac{d}{2} and η=2(μlog(1+μ))\eta=\sqrt{2(\mu-\log(1+\mu))} with μ=(1λ+αt(d))(12λ)a1\mu=\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)}{a}-1. Note we can take DD to be a sufficiently large universal constant in order to ensure aa is sufficiently large. Since c~dt2\widetilde{c}d\leq t^{2}, it follows from Lemma 22 that αt(d)C~t2\alpha_{t}(d)\leq\widetilde{C}t^{2} for some positive universal constant C~\widetilde{C}. Without loss of generality, we can take C~32\widetilde{C}\geq\frac{3}{2}. Let us restrict our attention to λ<12C~\lambda<\frac{1}{2\widetilde{C}}. Observe that

μ=1λ+αt(d)22λαt(d)dd(2C~2)+t222C~(C~t2)d2(C~1)d=Cd\mu=\frac{\frac{1}{\lambda}+\alpha_{t}(d)-2-2\lambda\alpha_{t}(d)-d}{d}\geq\frac{(2\widetilde{C}-2)+t^{2}-\frac{2}{2\widetilde{C}}(\widetilde{C}t^{2})}{d}\geq\frac{2(\widetilde{C}-1)}{d}=\frac{C^{**}}{d}

where we have defined C=2(C~1)C^{**}=2(\widetilde{C}-1). Since we have shown μ>0\mu>0, we can apply Corollary 2. By Corollary 2 we have

Γ(d2,(1λ+αt(d))(12λ))Γ(d2)(1Φ(ηa))+Cdexp(aη22).\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{C^{*}}{\sqrt{d}}\exp\left(-\frac{a\eta^{2}}{2}\right).

Here, Φ\Phi denotes the cumulative distribution function of the standard normal distribution and CC^{*} is a universal constant. By the Gaussian tail bound 1Φ(x)φ(x)x1-\Phi(x)\leq\frac{\varphi(x)}{x} for x>0x>0 where φ=Φ\varphi=\Phi^{\prime}, we have

Γ(d2,(1λ+αt(d))(12λ))Γ(d2)Cexp(aη22)(1ηa+1d)\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{*}\exp\left(-\frac{a\eta^{2}}{2}\right)\left(\frac{1}{\eta\sqrt{a}}+\frac{1}{\sqrt{d}}\right)

where the value of CC^{*} has changed from line to line but remains a universal constant. To continue the calculation, we need to bound ηa\eta\sqrt{a} from below. Note we can take DCD\geq C^{**}. For λ<12C~\lambda<\frac{1}{2\widetilde{C}}, it follows from Cd1\frac{C^{**}}{d}\leq 1 and that

ηa\displaystyle\eta\sqrt{a} =a2(μlog(1+μ))\displaystyle=\sqrt{a}\sqrt{2(\mu-\log(1+\mu))}
a2(Cdlog(1+Cd))\displaystyle\geq\sqrt{a}\sqrt{2\left(\frac{C^{**}}{d}-\log\left(1+\frac{C^{**}}{d}\right)\right)}
a2((C)22d(C)33d3)\displaystyle\geq\sqrt{a}\sqrt{2\cdot\left(\frac{(C^{**})^{2}}{2d}-\frac{(C^{**})^{3}}{3d^{3}}\right)}
=(C)22d(C)33d2\displaystyle=\sqrt{\frac{(C^{**})^{2}}{2d}-\frac{(C^{**})^{3}}{3d^{2}}}
=Cd12C3d\displaystyle=\frac{C^{**}}{\sqrt{d}}\sqrt{\frac{1}{2}-\frac{C^{**}}{3d}}
C6d.\displaystyle\geq\frac{C^{**}}{\sqrt{6d}}.

Consequently, we have the bound

Γ(d2,(1λ+αt(d))(12λ))Γ(d2)Cdexp(aη22)\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{*}\sqrt{d}\exp\left(-\frac{a\eta^{2}}{2}\right)

where the value of CC^{*} has changed but remains a universal constant. We now examine the term eaη2/2e^{-a\eta^{2}/2}. Consider that

exp(aη22)\displaystyle\exp\left(-\frac{a\eta^{2}}{2}\right) =exp(a(μlog(1+μ)))\displaystyle=\exp\left(-a\left(\mu-\log\left(1+\mu\right)\right)\right)
=exp(aμ)((1λ+αt(d))(12λ)d)d/2\displaystyle=\exp\left(-a\mu\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}
=exp(d212λ+1αt(d)2+λαt(d))((1λ+αt(d))(12λ)d)d/2.\displaystyle=\exp\left(\frac{d}{2}-\frac{1}{2\lambda}+1-\frac{\alpha_{t}(d)}{2}+\lambda\alpha_{t}(d)\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}.

Therefore, letting the value of CC^{*} change from line to line but remaining a universal constant, we have from (52)

E(eλY𝟙{Y>1/λ})\displaystyle E\left(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}}\right) Cdeλαt(d)(12λ)d/2exp(d212λ+1αt(d)2+λαt(d))((1λ+αt(d))(12λ)d)d/2\displaystyle\leq C^{*}\sqrt{d}\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\exp\left(\frac{d}{2}-\frac{1}{2\lambda}+1-\frac{\alpha_{t}(d)}{2}+\lambda\alpha_{t}(d)\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}
Cde(12λ+αt(d)d2)(1+1λ+αt(d)dd)d/2\displaystyle\leq C^{*}\sqrt{d}e^{-\left(\frac{1}{2\lambda}+\frac{\alpha_{t}(d)-d}{2}\right)}\left(1+\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)^{d/2}
=Cdexp((12λ+αt(d)d2)+d2log(1+1λ+αt(d)dd))\displaystyle=C^{*}\sqrt{d}\exp\left(-\left(\frac{1}{2\lambda}+\frac{\alpha_{t}(d)-d}{2}\right)+\frac{d}{2}\log\left(1+\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)\right)
=Cdexp(d2((1λ+αt(d)dd)log(1+1λ+αt(d)dd)))\displaystyle=C^{*}\sqrt{d}\exp\left(-\frac{d}{2}\left(\left(\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)-\log\left(1+\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)\right)\right)
Cdexp(d21λ+αt(d)dcd)\displaystyle\leq C^{*}\sqrt{d}\exp\left(-\frac{d}{2}\cdot\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{c^{**}d}\right)
Cdexp(12cλt22c)\displaystyle\leq C^{*}\sqrt{d}\exp\left(-\frac{1}{2c^{**}\lambda}-\frac{t^{2}}{2c^{**}}\right)

for a universal positive constant cc^{**}. We have used that ulog(1+u)uu-\log(1+u)\gtrsim u for u1u\geq 1. Note that we can use this since 1λ+αt(d)ddt2d1\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\geq\frac{t^{2}}{d}\gtrsim 1. Since e12cuu2e^{-\frac{1}{2c^{**}u}}\lesssim u^{2} for all uu\in\mathbb{R}, it follows that

E(eλY𝟙{Y>1/λ})Cdλ2exp(t22c)E\left(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}}\right)\leq C^{*}\sqrt{d}\lambda^{2}\exp\left(-\frac{t^{2}}{2c^{**}}\right) (53)

where the value of CC^{*}, again, has changed but remains a universal constant. Putting together our bounds (50), (51), (53) into (49) yields

E(eλY)1+Ct4λ2ect2exp(Ct4λ2ect2)E(e^{\lambda Y})\leq 1+Ct^{4}\lambda^{2}e^{-ct^{2}}\leq\exp\left(Ct^{4}\lambda^{2}e^{-ct^{2}}\right)

for λ<12C~\lambda<\frac{1}{2\widetilde{C}} where C,c>0C,c>0 are universal constants. The proof is complete. ∎

Lemma 9.

Let ZN(θ,Id)Z\sim N(\theta,I_{d}). Suppose c~\widetilde{c} is a universal positive constant. There exists a universal constant CC such that for every t2c~dt^{2}\geq\widetilde{c}d, we have

E{(Z2αt(d))𝟙{Z2d+t2}}{=0if θ=0,0if θ2<Ct2,θ2/2if θ2Ct2.E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}\begin{cases}=0&\textit{if }\theta=0,\\ \geq 0&\textit{if }||\theta||^{2}<Ct^{2},\\ \geq||\theta||^{2}/2&\textit{if }||\theta||^{2}\geq Ct^{2}.\end{cases}

Here, αt(d)\alpha_{t}(d) is given by (13).

Proof.

We will make a choice for CC at the end of the proof. Consider first the case where θ=0\theta=0. Then E{(Z2αt(d))𝟙{Z2d+t2}}=0E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}=0 by definition of αt(d)\alpha_{t}(d) and so we have the desired result. The second case follows since the expression inside the expectation is stochastically increasing in θ||\theta||. Moving on to the final case, suppose θ2Ct2||\theta||^{2}\geq Ct^{2}. Since αt(d)d+t2\alpha_{t}(d)\geq d+t^{2}, we have

E{(Z2αt(d))𝟙{Z2<d+t2}}0E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}}\right\}\leq 0

Consequently,

E{(Z2αt(d))𝟙{Z2d+t2}}\displaystyle E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\} =E{(Z2αt(d))(1𝟙{Z2<d+t2})}\displaystyle=E\left\{(||Z||^{2}-\alpha_{t}(d))(1-\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}})\right\}
=d+θ2αt(d)E{(Z2αt(d))𝟙{Z2<d+t2}}\displaystyle=d+||\theta||^{2}-\alpha_{t}(d)-E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}}\right\}
d+θ2αt(d)\displaystyle\geq d+||\theta||^{2}-\alpha_{t}(d)

By Lemma 22 and t2dt^{2}\gtrsim d, there exists a universal positive constant CC^{*} such that αt(d)d+Ct2\alpha_{t}(d)\leq d+C^{*}t^{2}. Therefore,

E{(Z2αt(d))𝟙{Z2d+t2}}θ2Ct2θ2(1CC).\displaystyle E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}\geq||\theta||^{2}-C^{*}t^{2}\geq||\theta||^{2}\left(1-\frac{C^{*}}{C}\right).

Selecting C=C2C=\frac{C^{*}}{2} yields the desired result. ∎

Lemma 10.

Let ZN(θ,Id)Z\sim N(\theta,I_{d}). If t2c~dt^{2}\geq\widetilde{c}d for a universal positive constant c~\widetilde{c}, then

Var((Z2αt(d))𝟙{Z2d+t2}){t4exp(cmin(t4d,t2))if θ=0,t4if 0<θ<2t,θ2if θ>2t.\operatorname{Var}\left((||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right)\lesssim\begin{cases}t^{4}\exp\left(-c^{*}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)&\textit{if }\theta=0,\\ t^{4}&\textit{if }0<||\theta||<2t,\\ ||\theta||^{2}&\textit{if }||\theta||>2t.\end{cases}

Here, αt(d)\alpha_{t}(d) is given by (13) and cc^{*} is a universal positive constant.

Proof.

The proof of Lemma 14 can be repeated with the modification of invoking Lemma 22 instead of Lemma 21. ∎

Lemma 11.

Suppose c~\widetilde{c} is a universal positive constant. There exist universal positive constants DD and CC^{*} such that if dDd\geq D, t2c~dt^{2}\geq\widetilde{c}d, and x>0x>0, then

P{i=1n(Zi2αt(d))𝟙{Z2d+t2}C(xnt4ect2+x)}exP\left\{\sum_{i=1}^{n}\left(||Z_{i}||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\geq C^{*}\left(\sqrt{xnt^{4}e^{-ct^{2}}}+x\right)\right\}\leq e^{-x}

where Z1,,ZniidN(0,Id)Z_{1},...,Z_{n}\overset{iid}{\sim}N(0,I_{d}) and αt(d)\alpha_{t}(d) is given by (13). Here, cc is a universal positive constant.

Proof.

Let DD be the constant from Lemma 8. We use the Chernoff method to obtain the desired bound. Let YY be as in Lemma 8 and let C~\widetilde{C} be the universal constant from Lemma 8. For any u>0u>0, we have by Lemma 8

P{j=1n(Zj2αt(d))𝟙{Zj2d+t2}>u}\displaystyle P\left\{\sum_{j=1}^{n}\left(||Z_{j}||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\{||Z_{j}||^{2}\geq d+t^{2}\}}>u\right\} infλ<12C~eλu(E(eλY))n\displaystyle\leq\inf_{\lambda<\frac{1}{2\widetilde{C}}}e^{-\lambda u}\left(E(e^{\lambda Y})\right)^{n}
infλ<12C~exp(λu+Cnt4λ2ect2)\displaystyle\leq\inf_{\lambda<\frac{1}{2\widetilde{C}}}\exp\left(-\lambda u+Cnt^{4}\lambda^{2}e^{-ct^{2}}\right)
=exp(C(u2nt4ect2u))\displaystyle=\exp\left(-C\left(\frac{u^{2}}{nt^{4}e^{-ct^{2}}}\wedge u\right)\right)

where c>0c>0 is a universal constant and C>0C>0 is a universal constant whose value may change from line to line but remain a universal constant. Selecting u=12(xnt4ect2C+xC)u=\frac{1}{2}\left(\sqrt{\frac{xnt^{4}e^{-ct^{2}}}{C}}+\frac{x}{C}\right) and selecting CC^{*} suitably yields the desired result. The proof is complete. ∎

A.2 Bulk

Lemma 12.

Let LL^{*} be the universal positive constant from Lemma 21. There exist universal constants C,C,C,c>0C^{*},C^{**},C,c>0 such that if dCd\geq C^{**} and 1βLd1\leq\beta\leq L^{*}\sqrt{d}, then

E(eλY)exp(Cdβ2λ2ecβ2)E(e^{\lambda Y})\leq\exp\left(Cd\beta^{2}\lambda^{2}e^{-c\beta^{2}}\right)

for λ<β2(βC+d)\lambda<\frac{\beta}{2(\beta C^{*}+\sqrt{d})}. Here, Y=(Z2αt(d))𝟙{Z2d+t2}Y=(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}} with t2=βdt^{2}=\beta\sqrt{d} and ZN(0,Id)Z\sim N(0,I_{d}). Also, αt(d)\alpha_{t}(d) is given by (13).

Proof.

The argument we give will be similar to the proof of Lemma 8, namely we will separately bound each term in (49) and substitute into the equation

E(eλY)=1+E(eλY1λY).E(e^{\lambda Y})=1+E\left(e^{\lambda Y}-1-\lambda Y\right).

We start with the first term on the right-hand side in (49). By Lemma 18, we have αt(d)d+Cβd\alpha_{t}(d)\leq d+C^{*}\beta\sqrt{d} for a universal positive constant CC^{*}. Note that C1C^{*}\geq 1 since we trivially have αt(d)d+t2=d+βd\alpha_{t}(d)\geq d+t^{2}=d+\beta\sqrt{d}. Consequently, arguing as in the proof of Lemma 8, we have

λ2E(Y2𝟙{Y<0})\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}}) λ2(d+tαt(d))2P{Z2d+t2}\displaystyle\leq\lambda^{2}(d+t-\alpha_{t}(d))^{2}P\{||Z||^{2}\geq d+t^{2}\}
Cλ2dβ2exp(cmin(t4d,t2))\displaystyle\leq C\lambda^{2}d\beta^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
=Cλ2dβ2ecβ2\displaystyle=C\lambda^{2}d\beta^{2}e^{-c\beta^{2}}

where C,c>0C,c>0 are universal constants. We have used Corollary 1 to obtain the final inequality. We now bound the second term in (49). Letting fdf_{d} denote the probability density function of the χd2\chi^{2}_{d} distribution, we can repeat and then continue the calculation in the proof of Lemma 8 to obtain

λ2E(Y2𝟙{0<Y<1/λ})\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0<Y<1/\lambda\}}) λ2αt(d)(zαt(d))2fd(z)𝑑z\displaystyle\leq\lambda^{2}\int_{\alpha_{t}(d)}^{\infty}(z-\alpha_{t}(d))^{2}f_{d}(z)\,dz
λ2d+t2(zαt(d))2fd(z)𝑑z\displaystyle\leq\lambda^{2}\int_{d+t^{2}}^{\infty}(z-\alpha_{t}(d))^{2}f_{d}(z)\,dz
=λ2(d+t2z2fd(z)𝑑z2αt(d)d+t2zfd(z)𝑑z+αt(d)2d+t2fd(z)𝑑z)\displaystyle=\lambda^{2}\left(\int_{d+t^{2}}^{\infty}z^{2}f_{d}(z)\,dz-2\alpha_{t}(d)\int_{d+t^{2}}^{\infty}zf_{d}(z)\,dz+\alpha_{t}(d)^{2}\int_{d+t^{2}}f_{d}(z)\,dz\right)
=λ2P{χd2d+t2}Var(Z2Z2d+t2)\displaystyle=\lambda^{2}P\{\chi^{2}_{d}\geq d+t^{2}\}\operatorname{Var}\left(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2}\right)
Cλ2dβ2exp(cmin(t4d,t2))\displaystyle\leq C\lambda^{2}d\beta^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
=Cλ2dβ2ecβ2\displaystyle=C\lambda^{2}d\beta^{2}e^{-c\beta^{2}}

where the values of C,c>0C,c>0 have changed but remain universal constants. We have used Lemma 24 here.

We now bound the final term in (49). Arguing exactly as in the proof of Lemma 8, we have

E(eλY𝟙{Y>1/λ})=eλαt(d)(12λ)d/2Γ(d2,(1λ+αt(d))(12λ))Γ(d2)E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}})=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}

for 0λ<120\leq\lambda<\frac{1}{2}. Recall that Γ(s,x)=xetts1𝑑t\Gamma(s,x)=\int_{x}^{\infty}e^{-t}t^{s-1}\,dt.

Let us restrict our attention to λ<β2(d+Cβ)\lambda<\frac{\beta}{2(\sqrt{d}+C^{*}\beta)}. We seek to apply Corollary 2, but we must verify the conditions. Let a=d2a=\frac{d}{2} and η=2(μlog(1+μ))\eta=\sqrt{2(\mu-\log\left(1+\mu\right))} with μ=(1λ+αt(d))(12λ)a1\mu=\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)}{a}-1. Observe

μ=(1λ+αt(d))(12λ)dd=12λλ+αt(d)(12λ)dd.\mu=\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)-d}{d}=\frac{\frac{1-2\lambda}{\lambda}+\alpha_{t}(d)(1-2\lambda)-d}{d}.

Consider that λ<β2(d+Cβ)=t22(d+Cβd)αt(d)d2αt(d)\lambda<\frac{\beta}{2(\sqrt{d}+C^{*}\beta)}=\frac{t^{2}}{2(d+C^{*}\beta\sqrt{d})}\leq\frac{\alpha_{t}(d)-d}{2\alpha_{t}(d)}. Therefore, αt(d)(12λ)d0\alpha_{t}(d)(1-2\lambda)-d\geq 0 and so

μ12λλd2βd.\mu\geq\frac{1-2\lambda}{\lambda d}\geq\frac{2}{\beta\sqrt{d}}.

Note we have used C1C^{*}\geq 1. Since μ>0\mu>0 and dCd\geq C^{**} which is a sufficiently large universal constant, we can apply Corollary 2. By Corollary 2, we have

Γ(d2,(1λ+αt(d))(12λ))Γ(d2)(1Φ(ηa))+Cdexp(aη22).\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{C^{{\dagger}}}{\sqrt{d}}\exp\left(-\frac{a\eta^{2}}{2}\right).

Here, CC^{{\dagger}} is a positive universal constant. Recall that Φ\Phi denotes the cumulative distribution function of the standard normal distribution. By the Gaussian tail bound 1Φ(x)φ(x)x1-\Phi(x)\leq\frac{\varphi(x)}{x} for x>0x>0, we have

Γ(d2,(1λ+αt(d))(12λ))Γ(d2)Cexp(aη22)(1ηa+1d)\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{{\dagger}}\exp\left(-\frac{a\eta^{2}}{2}\right)\left(\frac{1}{\eta\sqrt{a}}+\frac{1}{\sqrt{d}}\right) (54)

where the value of CC^{{\dagger}} has changed but remains a universal constant. To continue with the bound, we need to bound ηa\eta\sqrt{a} from below. Let us take CC^{**} larger than 44. Since a=d2a=\frac{d}{2}, we have

ηa\displaystyle\eta\sqrt{a} =a2(μlog(1+μ))\displaystyle=\sqrt{a}\sqrt{2(\mu-\log(1+\mu))}
a2(2βd+log(1+2βd))\displaystyle\geq\sqrt{a}\sqrt{2\left(\frac{2}{\beta\sqrt{d}}+\log\left(1+\frac{2}{\beta\sqrt{d}}\right)\right)}
a2(42β2d83β3d3/2)\displaystyle\geq\sqrt{a}\sqrt{2\left(\frac{4}{2\beta^{2}d}-\frac{8}{3\beta^{3}d^{3/2}}\right)}
cβ\displaystyle\geq\frac{c^{*}}{\beta}

where c>0c^{*}>0 is a universal constant. We can conclude from (54) that

Γ(d2,(1λ+αt(d))(12λ))Γ(d2)Cβexp(aη22)\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{{\dagger}}\beta\exp\left(-\frac{a\eta^{2}}{2}\right)

where the value of CC^{\dagger} has changed but remains a universal constant. We now examine the term eaη22e^{-\frac{a\eta^{2}}{2}}. Arguing exactly as in the proof of Lemma 8, we obtain

exp(aη22)=exp(d212λ+1αt(d)2+λαt(d))((1λ+αt(d))(12λ)d)d/2\exp\left(-\frac{a\eta^{2}}{2}\right)=\exp\left(\frac{d}{2}-\frac{1}{2\lambda}+1-\frac{\alpha_{t}(d)}{2}+\lambda\alpha_{t}(d)\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}

which, as argued in the proof of Lemma 8, leads us to the bound

E(eλY𝟙Y>1/λ)Cβexp(Cλ2dβ2ecβ2)E(e^{\lambda Y}\mathbbm{1}_{Y>1/\lambda})\leq C^{*}\beta\exp\left(C\lambda^{2}d\beta^{2}e^{-c\beta^{2}}\right)

for λ<β2(d+Cβ)\lambda<\frac{\beta}{2(\sqrt{d}+C^{*}\beta)} as desired. ∎

Lemma 13.

Let ZN(θ,Id)Z\sim N(\theta,I_{d}). Suppose 1βLd1\leq\beta\leq L^{*}\sqrt{d} where LL^{*} is the universal constant from Lemma 21. There exists a universal positive constant CC such that if t2=βdt^{2}=\beta\sqrt{d}, then we have

E{(Z2αt(d))𝟙{Z2d+t2}}{=0if θ=0,0if θ2<Ct2,θ2/2if θ2Ct2E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}\begin{cases}=0&\textit{if }\theta=0,\\ \geq 0&\textit{if }||\theta||^{2}<Ct^{2},\\ \geq||\theta||^{2}/2&\textit{if }||\theta||^{2}\geq Ct^{2}\end{cases}

where αt(d)\alpha_{t}(d) is given by (13).

Proof.

We will make a choice for CC later on in the proof. The proof for the cases θ=0\theta=0 and θ2Ct2||\theta||^{2}\leq Ct^{2} follows exactly as in the proof of Lemma 9. Here, we focus on the final case in which θ2Ct2||\theta||^{2}\geq Ct^{2}. Since αt(d)d+t2\alpha_{t}(d)\geq d+t^{2}, we have

E{(Z2αt(d))𝟙{Z2<d+t2}}0.E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}}\right\}\leq 0.

Since t2=βdt^{2}=\beta\sqrt{d}, we can apply Lemma 21 to obtain

E{(Z2αt(d))𝟙{Z2d+t2}}\displaystyle E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\} =E{(Z2αt(d))(1𝟙{Z2<d+t2})}\displaystyle=E\left\{(||Z||^{2}-\alpha_{t}(d))(1-\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}})\right\}
d+θ2αt(d)E{(Z2αt(d))𝟙{Z2<d+t2}}\displaystyle\geq d+||\theta||^{2}-\alpha_{t}(d)-E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}}\right\}
θ2Cβd\displaystyle\geq||\theta||^{2}-C^{*}\beta\sqrt{d}
=θ2(1Cβdθ2)\displaystyle=||\theta||^{2}\left(1-\frac{C^{*}\beta\sqrt{d}}{||\theta||^{2}}\right)
=θ2(1Ct2θ2)\displaystyle=||\theta||^{2}\left(1-\frac{C^{*}t^{2}}{||\theta||^{2}}\right)
θ2(1CC).\displaystyle\geq||\theta||^{2}\left(1-\frac{C^{*}}{C}\right).

where CC^{*} is a universal positive constant. Taking C=C2C=\frac{C^{*}}{2} completes the proof. ∎

Lemma 14.

Let ZN(θ,Id)Z\sim N(\theta,I_{d}). Suppose 1βLd1\leq\beta\leq L^{*}\sqrt{d} where LL^{*} is the universal constant from Lemma 21. Then there exists a universal positive constant cc^{*} such that if t2=βdt^{2}=\beta\sqrt{d}, then

Var((Z2αt(d))𝟙{Z2d+t2}){t4exp(ct4d)if θ=0,d+θ2if θ2t,t4if 0<θ<2t.\operatorname{Var}\left((||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right)\lesssim\begin{cases}t^{4}\exp\left(-\frac{c^{*}t^{4}}{d}\right)&\textit{if }\theta=0,\\ d+||\theta||^{2}&\textit{if }||\theta||\geq 2t,\\ t^{4}&\textit{if }0<||\theta||<2t.\end{cases}

Here, αt(d)\alpha_{t}(d) is given by (13).

Proof.

Let fdf_{d} and FdF_{d} respectively denote the probability density function and cumulative distribution function of the χd2\chi^{2}_{d} distribution.

Case 1: Consider the first case in which θ=0\theta=0. Then

Var((Z2αt(d))𝟙{Z2d+t2})\displaystyle\operatorname{Var}((||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}) E((Z2αt(d))2𝟙{Z2d+t2})\displaystyle\leq E\left((||Z||^{2}-\alpha_{t}(d))^{2}\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right)
=P{||Z||2d+t2}E((||Z||2αt(d))2|||Z||2d+t2)\displaystyle=P\left\{||Z||^{2}\geq d+t^{2}\right\}\cdot E\left((||Z||^{2}-\alpha_{t}(d))^{2}\,|\,||Z||^{2}\geq d+t^{2}\right)
=(1Fd(d+t2))Var(Z2Z2d+t2)\displaystyle=\left(1-F_{d}(d+t^{2})\right)\operatorname{Var}\left(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2}\right)
C(1Fd(d+t2))dβ2\displaystyle\leq C^{*}\left(1-F_{d}(d+t^{2})\right)d\beta^{2}

where we have applied the definition of αt(d)\alpha_{t}(d) and we have applied Lemma 24. Here, CC^{*} is a universal positive constant. An application of Corollary 1 gives the desired result for this case.

We now move to the other two cases. Suppose θ0\theta\neq 0. For ease of notation, let Y=(Z2αt(d))𝟙{Z2d+t2}Y=(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}. Observe

Var(Y)\displaystyle\operatorname{Var}(Y)
=E(Var(Y| 1{Z2d+t2}))+Var(E(Y| 1{Z2d+t2}))\displaystyle=E\left(\operatorname{Var}(Y\,|\,\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}})\right)+\operatorname{Var}\left(E(Y\,|\,\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}})\right)
P{||Z||2d+t2}Var(||Z||2|||Z||2d+t2)+P{||Z||2d+t2}(E(||Z||2αt(d)|||Z||2d+t2))2.\displaystyle\leq P\left\{||Z||^{2}\geq d+t^{2}\right\}\operatorname{Var}\left(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2}\right)+P\left\{||Z||^{2}\geq d+t^{2}\right\}\left(E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2})\right)^{2}. (55)

Note that 0E(||Z||2αt(d)|||Z||2d+t2)0\leq E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2}) since θ0\theta\neq 0 (by an appeal to stochastic ordering), and so we can find upper bounds for the square of this conditional expectation by first finding upper bounds on the conditional expectation. We examine each term separately in the above display. First, consider

P{Z2d+t2}Var(Z2Z2d+t2)\displaystyle P\left\{||Z||^{2}\geq d+t^{2}\right\}\operatorname{Var}\left(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2}\right) E(Var(Z2| 1{Z2d+t2}))\displaystyle\leq E\left(\operatorname{Var}\left(||Z||^{2}\,|\,\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right)\right)
Var(Z2)\displaystyle\leq\operatorname{Var}(||Z||^{2})
=2d+4θ2.\displaystyle=2d+4||\theta||^{2}. (56)

We now examine the second term. Note we can write Z=g+θZ=g+\theta where gN(0,Id)g\sim N(0,I_{d}). Therefore,

E(||Z||2αt(d)|||Z||2d+t2)\displaystyle E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2}) =E(||g+θ||2αt(d)|||g+θ||2d+t2)\displaystyle=E\left(||g+\theta||^{2}-\alpha_{t}(d)\,|\,||g+\theta||^{2}\geq d+t^{2}\right)
E(||g||2+2θ,g+||θ||2dt2|||g+θ||2d+t2)\displaystyle\leq E(||g||^{2}+2\langle\theta,g\rangle+||\theta||^{2}-d-t^{2}\,|\,||g+\theta||^{2}\geq d+t^{2})
=E(||g||2d|||g+θ||2d+t2)+||θ||2t2\displaystyle=E(||g||^{2}-d\,|\,||g+\theta||^{2}\geq d+t^{2})+||\theta||^{2}-t^{2} (57)

where we have used that θ,g𝟙{g+θ2d+t2}=𝑑θ,g𝟙{g+θ2d+t2}\langle\theta,g\rangle\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}}\overset{d}{=}\langle\theta,-g\rangle\mathbbm{1}_{\{||-g+\theta||^{2}\geq d+t^{2}\}}. With this in hand, we have E(θ,g𝟙{g+θ2d+t2})=0E(\langle\theta,g\rangle\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}})=0, which further implies E(θ,g|||g+θ||2d+t2)=0E(\langle\theta,g\rangle\,|\,||g+\theta||^{2}\geq d+t^{2})=0. Note we have also used αt(d)d+t2\alpha_{t}(d)\geq d+t^{2} to obtain the second line in the above display. With the above display in hand, we now examine the remaining two cases.

Case 2: Consider the case θ2t||\theta||\geq 2t. Observe that

E(||g||2d|||g+θ||2d+t2)=E((g2d)𝟙{g+θ2d+t2})P{g+θ2d+t2}.E(||g||^{2}-d\,|\,||g+\theta||^{2}\geq d+t^{2})=\frac{E\left((||g||^{2}-d)\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}}\right)}{P\left\{||g+\theta||^{2}\geq d+t^{2}\right\}}.

Examining the denominator, since g+θ2χd2(θ2)=𝑑χd12+χ12(θ2)||g+\theta||^{2}\sim\chi^{2}_{d}(||\theta||^{2})\overset{d}{=}\chi^{2}_{d-1}+\chi^{2}_{1}(||\theta||^{2}) where the two χ2\chi^{2}-variates on the right hand side are independent, we have

P{g+θ2d+t2}P{χd12d1}P{χ12(θ2)1+t2}cP{χ12(θ2)1+t2}P\{||g+\theta||^{2}\geq d+t^{2}\}\geq P\{\chi^{2}_{d-1}\geq d-1\}P\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}\geq cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}

where cc is a universal positive constant. Examining the numerator, consider that by Lemma 18 we have

E((g2d)𝟙{g+θ2d+t2})\displaystyle E\left((||g||^{2}-d)\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}}\right) E((g2d)𝟙{g2d})\displaystyle\leq E((||g||^{2}-d)\mathbbm{1}_{\{||g||^{2}\geq d\}})
=d(zd)fd(z)𝑑z\displaystyle=\int_{d}^{\infty}(z-d)f_{d}(z)\,dz
=dd(fd+2(z)fd(z))𝑑z\displaystyle=d\int_{d}^{\infty}(f_{d+2}(z)-f_{d}(z))\,dz
=2ddfd+2(z)𝑑z\displaystyle=-2d\int_{d}^{\infty}f_{d+2}^{\prime}(z)\,dz
=2dfd+2(d).\displaystyle=2df_{d+2}(d).

Hence, from (57), we have shown

E(||Z||2αt(d)|||Z||2d+t2)2dfd+2(d)cP{χ12(θ2)1+t2}+||θ||2t2.E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2})\leq\frac{2df_{d+2}(d)}{cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}}+||\theta||^{2}-t^{2}.

Thus, we have the bound

Var(Y)2d+4θ2+P{Z2d+t2}(2dfd+2(d)cP{χ12(θ2)1+t2}+θ2t2)2.\displaystyle\operatorname{Var}\left(Y\right)\leq 2d+4||\theta||^{2}+P\{||Z||^{2}\geq d+t^{2}\}\left(\frac{2df_{d+2}(d)}{cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}}+||\theta||^{2}-t^{2}\right)^{2}.

Consider that

2dfd+2(d)c=dcΓ(d2+1)(d2e)d/2cd\frac{2df_{d+2}(d)}{c}=\frac{d}{c\Gamma\left(\frac{d}{2}+1\right)}\left(\frac{d}{2e}\right)^{d/2}\leq c\sqrt{d}

where the value of cc can change in each expression but remains a universal positive constant. The final inequality follows from Stirling’s formula. Moreover, since θ2t||\theta||\geq 2t, there exists a positive universal constant cc^{\prime} such that P{χ12(θ2)1+t2}1/cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}\geq 1/c^{\prime}. Furthermore, it follows by Chebyshev’s inequality that

P{Z2<d+t2}=P{θ2t2<d+θ2Z2}2d+4θ2(θ2t2)2.P\{||Z||^{2}<d+t^{2}\}=P\{||\theta||^{2}-t^{2}<d+||\theta||^{2}-||Z||^{2}\}\leq\frac{2d+4||\theta||^{2}}{(||\theta||^{2}-t^{2})^{2}}.

Consequently,

Var(Y)\displaystyle\operatorname{Var}(Y) 2d+4θ2+2d+4θ2(θ2t2)2(ccd+θ2t2)2\displaystyle\leq 2d+4||\theta||^{2}+\frac{2d+4||\theta||^{2}}{(||\theta||^{2}-t^{2})^{2}}(cc^{\prime}\sqrt{d}+||\theta||^{2}-t^{2})^{2}
d+θ2+d+θ2(θ2t2)2(d+θ2)2\displaystyle\asymp d+||\theta||^{2}+\frac{d+||\theta||^{2}}{(||\theta||^{2}-t^{2})^{2}}(\sqrt{d}+||\theta||^{2})^{2}
d+θ2+d+θ2θ4θ4\displaystyle\asymp d+||\theta||^{2}+\frac{d+||\theta||^{2}}{||\theta||^{4}}||\theta||^{4}
d+θ2.\displaystyle\asymp d+||\theta||^{2}.

The proof for this case is complete.

Case 3: Consider the case 0<θ<2t0<||\theta||<2t. Then

E(||Z||2αt(d)|||Z||2d+t2)\displaystyle E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2}) =E(||g||2d|||g+θ||2d+t2)+||θ||2t2\displaystyle=E(||g||^{2}-d\,|\,||g+\theta||^{2}\geq d+t^{2})+||\theta||^{2}-t^{2}
E(||g||2d|||g+θ||2d+t2)+3t2\displaystyle\leq E(||g||^{2}-d\,|\,||g+\theta||^{2}\geq d+t^{2})+3t^{2}
E(||g||2d|||g||2d+t2)+3t2\displaystyle\leq E(||g||^{2}-d\,|\,||g||^{2}\geq d+t^{2})+3t^{2}
=αt(d)d+3t2\displaystyle=\alpha_{t}(d)-d+3t^{2}
βd+t2\displaystyle\lesssim\beta\sqrt{d}+t^{2}
t2\displaystyle\asymp t^{2}

where we have used Lemma 21. Therefore, using the above bound with (56) and plugging into (55), we obtain

Var(Y)d+θ2+t4t4.\operatorname{Var}(Y)\lesssim d+||\theta||^{2}+t^{4}\asymp t^{4}.

The proof is complete. ∎

Lemma 15.

Let LL^{*} be the universal positive constant from Lemma 21. There exists universal positive constants DD and CC^{*} such that if dDd\geq D, then for any 1βLd1\leq\beta\leq L^{*}\sqrt{d} and x>0x>0, we have

P{j=1n(Zj2αt(d))𝟙{Zj2d+t2}C(xnt4ect4d+dt2x)}exP\left\{\sum_{j=1}^{n}\left(||Z_{j}||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\{||Z_{j}||^{2}\geq d+t^{2}\}}\geq C^{*}\left(\sqrt{xnt^{4}e^{-\frac{ct^{4}}{d}}}+\frac{d}{t^{2}}x\right)\right\}\leq e^{-x}

where Z1,,ZniidN(0,Id)Z_{1},...,Z_{n}\overset{iid}{\sim}N(0,I_{d}), t2=βdt^{2}=\beta\sqrt{d}, and αt(d)\alpha_{t}(d) is given by (13). Here, c>0c>0 is a universal constant.

Proof.

The proof of Lemma 11 can be repeated with the modification of invoking Lemma 12 instead of Lemma 8 and taking infimum over λ<β2(βC+d)\lambda<\frac{\beta}{2\left(\beta C^{*}+\sqrt{d}\right)}. ∎

Appendix B Properties of the χ2d\chi^{2}_{d} distribution

Theorem 7 (Bernstein’s inequality - Theorem 2.8.1 [46]).

Let Y1,,YdY_{1},...,Y_{d} be independent mean-zero subexponential random variables. Then, for every u0u\geq 0, we have

P{|i=1dYi|u}2exp(cmin(u2i=1d||Xi||2ψ1,umaxi||Xi||ψ1))P\left\{\left|\sum_{i=1}^{d}Y_{i}\right|\geq u\right\}\leq 2\exp\left(-c\min\left(\frac{u^{2}}{\sum_{i=1}^{d}||X_{i}||^{2}_{\psi_{1}}},\frac{u}{\max_{i}||X_{i}||_{\psi_{1}}}\right)\right)

where c>0c>0 is a universal constant and ψ2,ψ1\psi_{2},\psi_{1} denote the subgaussian and subexponential norms respectively (see (2.13) and (2.21) of [46])

Corollary 1.

Suppose ZN(0,Id)Z\sim N(0,I_{d}). If u0u\geq 0, then

P{||Z||2d+u}2exp(cmin(u2d,u))P\left\{||Z||^{2}\geq d+u\right\}\leq 2\exp\left(-c\min\left(\frac{u^{2}}{d},u\right)\right)

where c>0c>0 is a universal constant.

Lemma 16 (Lemma 1 [31]).

Let Z1,,ZdiidN(0,1)Z_{1},...,Z_{d}\overset{iid}{\sim}N(0,1). If λ1λd>0\lambda_{1}\geq...\geq\lambda_{d}>0 and x>0x>0, then

P{j=1dλjZj2j=1dλj+2xj=1dλj2+2λ1x}ex.P\left\{\sum_{j=1}^{d}\lambda_{j}Z_{j}^{2}\geq\sum_{j=1}^{d}\lambda_{j}+2\sqrt{x\sum_{j=1}^{d}\lambda_{j}^{2}}+2\lambda_{1}x\right\}\leq e^{-x}.
Lemma 17 (Corollary 3 [51]).

Suppose ZN(0,Id)Z\sim N(0,I_{d}). There exist universal constants C,c>0C,c>0 such that

P{||Z||2d+u}Cexp(cmin(u2d,u))P\left\{||Z||^{2}\geq d+u\right\}\geq C\exp\left(-c\min\left(\frac{u^{2}}{d},u\right)\right)

for all u>0u>0.

Lemma 18 ([28]).

Let fdf_{d} and FdF_{d} respectively denote the probability density and cumulative distribution functions of the χ2d\chi^{2}_{d} distribution. Then the following relations hold,

  1. (i)

    tfd(t)=dfd+2(t)tf_{d}(t)=df_{d+2}(t),

  2. (ii)

    t2fd(t)=d(d+2)fd+4(t)t^{2}f_{d}(t)=d(d+2)f_{d+4}(t),

  3. (iii)

    fd+2(t)=fd(t)fd+2(t)2f_{d+2}^{\prime}(t)=\frac{f_{d}(t)-f_{d+2}(t)}{2},

  4. (iv)

    (1Fd+2(t))(1Fd(t))=et/2td/22d/2Γ(d/2+1)(1-F_{d+2}(t))-(1-F_{d}(t))=\frac{e^{-t/2}t^{d/2}}{2^{d/2}\Gamma(d/2+1)}.

Lemma 19.

Let fdf_{d} and FdF_{d} respectively denote the probability density and cumulative distribution functions of the χ2d\chi^{2}_{d} distribution. Suppose d>2d>2. If tdt\geq d, then 2fd(t)1Fd(t)2f_{d}(t)\leq 1-F_{d}(t).

Proof.

By Mean Value Theorem, we have for any rtr\geq t,

infxdx(1Fd(x))2fd(x)\displaystyle\inf_{x\geq d}\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)} (1Fd(t))(1Fd(r))2fd(t)2fd(r)\displaystyle\leq\frac{(1-F_{d}(t))-(1-F_{d}(r))}{2f_{d}(t)-2f_{d}(r)}
=1Fd(t)2fd(t)1Fd(r)2fd(t)1fd(r)fd(t).\displaystyle=\frac{\frac{1-F_{d}(t)}{2f_{d}(t)}-\frac{1-F_{d}(r)}{2f_{d}(t)}}{1-\frac{f_{d}(r)}{f_{d}(t)}}.

Consider that limr1Fd(r)=limrfd(r)=0\lim_{r\to\infty}1-F_{d}(r)=\lim_{r\to\infty}f_{d}(r)=0. So taking rr\to\infty yields

infxdx(1Fd(x))2fd(x)1Fd(t)2fd(t).\inf_{x\geq d}\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)}\leq\frac{1-F_{d}(t)}{2f_{d}(t)}.

We now evaluate the infimum on the left-hand side. For xdx\geq d, consider that an application of Lemma 18 gives

x(1Fd(x))2fd(x)\displaystyle\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)} =fd(x)fd2(x)fd(x)\displaystyle=-\frac{f_{d}(x)}{f_{d-2}(x)-f_{d}(x)}
=1fd2(x)fd(x)1\displaystyle=-\frac{1}{\frac{f_{d-2}(x)}{f_{d}(x)}-1}
=1xfd2(x)xfd(x)1\displaystyle=-\frac{1}{\frac{xf_{d-2}(x)}{xf_{d}(x)}-1}
=1d2x1\displaystyle=-\frac{1}{\frac{d-2}{x}-1}
=11d2x.\displaystyle=\frac{1}{1-\frac{d-2}{x}}.

Since xd>d2x\geq d>d-2, it follows that

infxdx(1Fd(x))2fd(x)=1.\inf_{x\geq d}\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)}=1.

Thus we can immediately conclude 2fd(t)1Fd(t)2f_{d}(t)\leq 1-F_{d}(t) as desired. ∎

Lemma 20.

Let FdF_{d} denote the cumulative distribution function of the χ2d\chi^{2}_{d} distribution. If x0x\geq 0, then

1Fd(x)=1Γ(d2)x/2td/21etdt=Q(d2,x2)1-F_{d}(x)=\frac{1}{\Gamma\left(\frac{d}{2}\right)}\int_{x/2}^{\infty}t^{d/2-1}e^{-t}\,dt=Q\left(\frac{d}{2},\frac{x}{2}\right)

where QQ is the upper incomplete gamma function defined in Theorem 8.

Proof.

The result follows directly from a change of variables when integrating the probability density function. ∎

Lemma 21.

Suppose dd is larger than a sufficiently large universal constant. There exist universal positive constants LL^{*} and CC^{*} such that the following holds. If 1βLd1\leq\beta\leq L^{*}\sqrt{d} and t2=βdt^{2}=\beta\sqrt{d}, then

αt(d)d+Cβd\alpha_{t}(d)\leq d+C^{*}\beta\sqrt{d}

where αt(d)\alpha_{t}(d) is given by (13).

Proof.

For ease of notation, let r2=d+t2r^{2}=d+t^{2}. Let fdf_{d} and FdF_{d} denote the probability density and cumulative distribution functions of the χ2d\chi^{2}_{d} distribution. We will select the universal constant LL^{*} later on in the proof. By Lemma 18, we have

αt(d)\displaystyle\alpha_{t}(d) =r2zfd(z)dz1Fd(r2)\displaystyle=\frac{\int_{r^{2}}^{\infty}zf_{d}(z)\,dz}{1-F_{d}(r^{2})}
=r2dfd+2(z)dz1Fd(r2)\displaystyle=\frac{\int_{r^{2}}^{\infty}df_{d+2}(z)\,dz}{1-F_{d}(r^{2})}
=d(1+(1Fd+2(r2))(1Fd(r2))1Fd(r2))\displaystyle=d\left(1+\frac{\left(1-F_{d+2}(r^{2})\right)-\left(1-F_{d}(r^{2})\right)}{1-F_{d}(r^{2})}\right)
=d(1+rder2/22d/2Γ(d2+1)11Fd(r2))\displaystyle=d\left(1+\frac{r^{d}e^{-r^{2}/2}}{2^{d/2}\Gamma\left(\frac{d}{2}+1\right)}\cdot\frac{1}{1-F_{d}(r^{2})}\right)
=d(1+(r22)d/2er2/2Γ(d2+1)11Fd(r2))\displaystyle=d\left(1+\left(\frac{r^{2}}{2}\right)^{d/2}\frac{e^{-r^{2}/2}}{\Gamma\left(\frac{d}{2}+1\right)}\cdot\frac{1}{1-F_{d}(r^{2})}\right)

Rearranging terms and invoking Stirling’s approximation (which states Γ(x+1)2πx(xe)x\Gamma(x+1)\sim\sqrt{2\pi x}\left(\frac{x}{e}\right)^{x} as xx\to\infty) yields

(r22)d/2er2/2Γ(d2+1)1+cπdexp(d2log(r22)r22d2log(d2)+d2)\left(\frac{r^{2}}{2}\right)^{d/2}\frac{e^{-r^{2}/2}}{\Gamma\left(\frac{d}{2}+1\right)}\leq\frac{1+c}{\sqrt{\pi d}}\exp\left(\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{r^{2}}{2}-\frac{d}{2}\log\left(\frac{d}{2}\right)+\frac{d}{2}\right)

for a universal constant c>0c>0 since dd is larger than a sufficiently large universal constant. Applying Lemma 20 and Corollary 2 yields

1Fd(r2)=Q(d2,r22)(1Φ(ηa))eaη2/22πcd1-F_{d}(r^{2})=Q\left(\frac{d}{2},\frac{r^{2}}{2}\right)\geq\left(1-\Phi(\eta\sqrt{a})\right)-\frac{e^{-a\eta^{2}/2}}{\sqrt{2\pi}}\cdot\frac{c^{**}}{\sqrt{d}}

where cc^{**} is a universal constant. Observe that ηa\eta\sqrt{a} is larger than a sufficiently large universal constant since dd is larger than a sufficiently large universal constant. Using the fact that 1Φ(x)=1x2πex2/2(1+o(1))1-\Phi(x)=\frac{1}{x\sqrt{2\pi}}e^{-x^{2}/2}\left(1+o(1)\right) as xx\to\infty, we have

1Fd(r2)12πeaη22(cηacd)1-F_{d}(r^{2})\geq\frac{1}{\sqrt{2\pi}}e^{-\frac{a\eta^{2}}{2}}\left(\frac{c^{*}}{\eta\sqrt{a}}-\frac{c^{**}}{\sqrt{d}}\right)

for a universal positive constant cc^{*}. Consider that

η22=μlog(1+μ)=r22d21log(r22)+log(d2).\frac{\eta^{2}}{2}=\mu-\log(1+\mu)=\frac{\frac{r^{2}}{2}}{\frac{d}{2}}-1-\log\left(\frac{r^{2}}{2}\right)+\log\left(\frac{d}{2}\right).

Consequently,

η2a2=(r22d2d2log(r22)+d2log(d2))=r22+d2+d2log(r22)d2log(d2).-\frac{\eta^{2}a}{2}=-\left(\frac{r^{2}}{2}-\frac{d}{2}-\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)+\frac{d}{2}\log\left(\frac{d}{2}\right)\right)=-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right).

Therefore, we have the bound

1Fd(r2)12πexp(r22+d2+d2log(r22)d2log(d2))(cηacd).1-F_{d}(r^{2})\geq\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right)\right)\cdot\left(\frac{c^{*}}{\eta\sqrt{a}}-\frac{c^{**}}{\sqrt{d}}\right).

Consider further that the inequality xlog(1+x)x22x2x-\log(1+x)\leq\frac{x^{2}}{2}\leq x^{2} gives us

ηa=d22(βdlog(1+βd))=d(βdlog(1+βd))β.\displaystyle\eta\sqrt{a}=\sqrt{\frac{d}{2}}\cdot\sqrt{2\left(\frac{\beta}{\sqrt{d}}-\log\left(1+\frac{\beta}{\sqrt{d}}\right)\right)}=\sqrt{d\left(\frac{\beta}{\sqrt{d}}-\log\left(1+\frac{\beta}{\sqrt{d}}\right)\right)}\leq\beta.

Hence, we have

1Fd(r2)12πexp(r22+d2+d2log(r22)d2log(d2))(cβcd).1-F_{d}(r^{2})\geq\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right)\right)\left(\frac{c^{*}}{\beta}-\frac{c^{**}}{\sqrt{d}}\right).

Let us take L:=c2cL^{*}:=\frac{c^{*}}{2c^{**}}. With this choice, we have βdc2c\beta\leq\sqrt{d}\cdot\frac{c^{*}}{2c^{**}} and so (cβcd)c2β\left(\frac{c^{*}}{\beta}-\frac{c^{**}}{\sqrt{d}}\right)\geq\frac{c^{*}}{2\beta}. Therefore,

(r22)d/2er2/2Γ(d2+1)11Fd(r2)1+cπd2β2πcexp(d2log(r22)r22d2log(d2)+d2)exp(r22+d2+d2log(r22)d2log(d2))Cβd\left(\frac{r^{2}}{2}\right)^{d/2}\frac{e^{-r^{2}/2}}{\Gamma\left(\frac{d}{2}+1\right)}\cdot\frac{1}{1-F_{d}(r^{2})}\leq\frac{1+c}{\sqrt{\pi d}}\cdot\frac{2\beta\sqrt{2\pi}}{c^{*}}\cdot\frac{\exp\left(\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{r^{2}}{2}-\frac{d}{2}\log\left(\frac{d}{2}\right)+\frac{d}{2}\right)}{\exp\left(-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right)\right)}\leq C^{*}\frac{\beta}{\sqrt{d}}

for a universal positive constant CC^{*}. Thus we have

αt(d)d(1+Cβd)=d+Cβd\alpha_{t}(d)\leq d\left(1+\frac{C^{*}\beta}{\sqrt{d}}\right)=d+C^{*}\beta\sqrt{d}

as desired. ∎

Theorem 8 (Uniform expansion of the incomplete gamma function [43]).

For a>0a>0 and x0x\geq 0 real numbers, define the upper incomplete gamma function

Q(a,x):=1Γ(a)xta1etdt.Q(a,x):=\frac{1}{\Gamma(a)}\int_{x}^{\infty}t^{a-1}e^{-t}\,dt.

Further define λ=xa,μ=λ1\lambda=\frac{x}{a},\mu=\lambda-1, and η=2(μlog(1+μ))\eta=\sqrt{2(\mu-\log(1+\mu))}. Then QQ admits an asymptotic series expansion in aa which is uniform in η\eta\in\mathbb{R}. In other words, for any integer N0N\geq 0, we have

Q(a,x)=(1Φ(ηa))+eaη222πak=0Nck(η)ak+RemN(a,η)Q(a,x)=\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}\sum_{k=0}^{N}c_{k}(\eta)a^{-k}+\operatorname{Rem}_{N}(a,\eta)

where the remainder term satisfies

limasupη|RemN(a,η)eaη222πacN(η)aN|=0.\lim_{a\to\infty}\sup_{\eta\in\mathbb{R}}\left|\frac{\operatorname{Rem}_{N}(a,\eta)}{\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}c_{N}(\eta)a^{-N}}\right|=0.

Here, Φ\Phi denotes the cumulative distribution function of the standard normal distribution.

Theorem 9 (Theorem 1 in [42]).

Consider the setting of Theorem 8. The coefficient c0c_{0} is given by c0(η)=1μ1ηc_{0}(\eta)=\frac{1}{\mu}-\frac{1}{\eta}.

Corollary 2.

Consider the setting of Theorem 8. For any a>0a>0 and x0x\geq 0, we have

Q(a,x)=(1Φ(ηa))+eaη222πa(1μ1η)+Rem0(a,η)Q(a,x)=\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}\left(\frac{1}{\mu}-\frac{1}{\eta}\right)+\operatorname{Rem}_{0}(a,\eta)

where

limasupη|Rem0(a,η)eaη222πa(1μ1η)|=0.\lim_{a\to\infty}\sup_{\eta\in\mathbb{R}}\left|\frac{\operatorname{Rem}_{0}(a,\eta)}{\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}\left(\frac{1}{\mu}-\frac{1}{\eta}\right)}\right|=0.

Consequently, if aa is larger than some universal positive constant and μ>0\mu>0, we have

|Q(a,x)(1Φ(ηa))|Ceaη222πa\left|Q(a,x)-\left(1-\Phi\left(\eta\sqrt{a}\right)\right)\right|\leq\frac{Ce^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}

where CC is some universal positive constant.

Proof.

The first two displays follow exactly from Theorems 8 and 9. To show the final display, we must show that |1μ1η|1\left|\frac{1}{\mu}-\frac{1}{\eta}\right|\lesssim 1 whenever μ>0\mu>0. First, consider the Taylor expansion

log(1+μ)=μμ22(1+ξ)2\log(1+\mu)=\mu-\frac{\mu^{2}}{2(1+\xi)^{2}}

where ξ\xi is some point between 0 and μ\mu. Therefore,

2(μlog(1+μ))=μ1+ξ.\sqrt{2(\mu-\log(1+\mu))}=\frac{\mu}{1+\xi}.

Thus

|1μ1η|=|1μ1+ξμ|=ξμ1\displaystyle\left|\frac{1}{\mu}-\frac{1}{\eta}\right|=\left|\frac{1}{\mu}-\frac{1+\xi}{\mu}\right|=\frac{\xi}{\mu}\leq 1

since ξ\xi is between 0 and μ\mu. Therefore, the final display in the statement of the Corollary follows by taking aa to be larger than some universal constant and taking CC to be a large enough universal constant. ∎

Lemma 22.

Let ZN(0,Id)Z\sim N(0,I_{d}). If t0t\geq 0, then

E(||Z||2|||Z||2d+t2)d+C(t2d)E(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})\leq d+C^{*}(t^{2}\vee d)

where CC^{*} is a positive universal constant.

Proof.

We will choose a universal constant L1L\geq 1 at the end of the proof, so for now we leave it as undetermined. Let fdf_{d} denote the probability density function of the χ2d\chi^{2}_{d} distribution. Observe that

E(||Z||2|||Z||2d+t2)\displaystyle E(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2}) =E(||Z||2𝟙{||Z||2d+Lt2}|||Z||2d+t2)+E(||Z||2𝟙{||Z||2>d+Lt2}|||Z||2d+t2)\displaystyle=E(||Z||^{2}\mathbbm{1}_{\{||Z||^{2}\leq d+Lt^{2}\}}\,|\,||Z||^{2}\geq d+t^{2})+E(||Z||^{2}\mathbbm{1}_{\{||Z||^{2}>d+Lt^{2}\}}\,|\,||Z||^{2}\geq d+t^{2})
d+Lt2+E(||Z||4|||Z||2d+t2)P{||Z||2>d+Lt2}.\displaystyle\leq d+Lt^{2}+\sqrt{E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2})}\sqrt{P\{||Z||^{2}>d+Lt^{2}\}}.

By Corollary 1, there exists a universal constant c1c_{1} such that

P{||Z||2>d+Lt2}2exp(Lc1min(t4d,t2)).\sqrt{P\{||Z||^{2}>d+Lt^{2}\}}\leq\sqrt{2}\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

Here we have used L1L\geq 1. Further consider that

E(||Z||4|||Z||2d+t2)\displaystyle E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2}) =d+t2z2fd(z)dzP{||Z||2d+t2}\displaystyle=\frac{\int_{d+t^{2}}^{\infty}z^{2}f_{d}(z)\,dz}{P\{||Z||^{2}\geq d+t^{2}\}}
E(||Z||4)P{||Z||2d+t2}\displaystyle\leq\frac{E(||Z||^{4})}{P\{||Z||^{2}\geq d+t^{2}\}}
=d(d+2)P{||Z||2d+t2}.\displaystyle=\frac{d(d+2)}{P\{||Z||^{2}\geq d+t^{2}\}}.

Therefore, by Lemma 17 we have

E(||Z||2|||Z||2d+t2)d+Lt2+dCexp(Lc1min(t4d,t2))exp(c2min(t4d,t2))E(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})\leq d+Lt^{2}+dC\cdot\frac{\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}

where CC and c2c_{2} are universal positive constants. Taking L:=c2c11L:=\frac{c_{2}}{c_{1}}\vee 1 completes the proof. ∎

Lemma 23.

Let ZN(0,Id)Z\sim N(0,I_{d}). If t0t\geq 0, then

E(||Z||4|||Z||2d+t2)d2+C(t4d2)E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2})\leq d^{2}+C^{*}(t^{4}\vee d^{2})

where CC^{*} is a positive universal constant.

Proof.

We will choose a universal constant L1L\geq 1 at the end of the proof, so for now we leave it as undetermined. Let fdf_{d} denote the probability density function of the χ2d\chi^{2}_{d} distribution. Consider

E(||Z||4|||Z||2d+t2)\displaystyle E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2})
=E(||Z||4𝟙{||Z||2d+Lt2}|||Z||2d+t2)+E(||Z||4𝟙{||Z||2d+Lt2}|||Z||2d+t2)\displaystyle=E(||Z||^{4}\mathbbm{1}_{\{||Z||^{2}\leq d+Lt^{2}\}}\,|\,||Z||^{2}\geq d+t^{2})+E(||Z||^{4}\mathbbm{1}_{\{||Z||^{2}\leq d+Lt^{2}\}}\,|\,||Z||^{2}\geq d+t^{2})
d2+2Lt2d+L2t4+E(||Z||8|||Z||2d+t2)P{||Z||2>d+Lt2}.\displaystyle\leq d^{2}+2Lt^{2}d+L^{2}t^{4}+\sqrt{E(||Z||^{8}\,|\,||Z||^{2}\geq d+t^{2})}\sqrt{P\{||Z||^{2}>d+Lt^{2}\}}.

By Corollary 1, there exists a universal positive constant c1c_{1} such that

P{||Z||2>d+Lt2}2exp(Lc1min(t4d,t2)).\sqrt{P\{||Z||^{2}>d+Lt^{2}\}}\leq\sqrt{2}\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

Here we have used L1L\geq 1. Further consider

E(||Z||8|||Z||2d+t2)\displaystyle E(||Z||^{8}\,|\,||Z||^{2}\geq d+t^{2}) =d+t2z4fd(z)dzP{||Z||2d+t2}\displaystyle=\frac{\int_{d+t^{2}}^{\infty}z^{4}f_{d}(z)\,dz}{P\{||Z||^{2}\geq d+t^{2}\}}
E(||Z||8)P{||Z||2d+t2}\displaystyle\leq\frac{E(||Z||^{8})}{P\{||Z||^{2}\geq d+t^{2}\}}
=d(d+2)(d+4)(d+6)P{||Z||2d+t2}.\displaystyle=\frac{d(d+2)(d+4)(d+6)}{P\{||Z||^{2}\geq d+t^{2}\}}.

Hence, by Lemma 17 we have

E(||Z||4|||Z||2d+t2)d2+2Lt2d+L2t4+d2Cexp(Lc1min(t4d,t2))exp(c2min(t4d,t2))E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2})\leq d^{2}+2Lt^{2}d+L^{2}t^{4}+d^{2}C\frac{\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}

where C>0C>0 is a universal constant. Taking L>c2c11L>\frac{c_{2}}{c_{1}}\vee 1 completes the proof. ∎

Lemma 24.

Let ZN(0,Id)Z\sim N(0,I_{d}). If t0t\geq 0, then

Var(||Z||2|||Z||2d+t2)t4+deCmin(t4d,t2)\operatorname{Var}(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})\lesssim t^{4}+de^{-C\min\left(\frac{t^{4}}{d},t^{2}\right)}

for some universal positive constant CC.

Proof.

We will use a universal constant L1L\geq 1 in our proof; a choice for it will be made at the end. Let fdf_{d} denote the probability density of the χ2d\chi^{2}_{d} distribution. Observe that

Var(||Z||2|||Z||2d+t2)\displaystyle\operatorname{Var}(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})
=Var(||Z||2d|||Z||2d+t2)\displaystyle=\operatorname{Var}(||Z||^{2}-d\,|\,||Z||^{2}\geq d+t^{2})
E((||Z||2d)2|||Z||2d+t2)\displaystyle\leq E((||Z||^{2}-d)^{2}\,|\,||Z||^{2}\geq d+t^{2})
=E((||Z||2d)2𝟙{||Z||2d+Lt2}|||Z||2d+t2)+E((||Z||2d)2𝟙{||Z||2>d+Lt2}|||Z||2d+t2)\displaystyle=E((||Z||^{2}-d)^{2}\mathbbm{1}_{\{||Z||^{2}\leq d+Lt^{2}\}}\,|\,||Z||^{2}\geq d+t^{2})+E((||Z||^{2}-d)^{2}\mathbbm{1}_{\{||Z||^{2}>d+Lt^{2}\}}\,|\,||Z||^{2}\geq d+t^{2})
L2t4+E((||Z||2d)4|||Z||2d+t2)P{||Z||2d+Lt2}.\displaystyle\leq L^{2}t^{4}+\sqrt{E((||Z||^{2}-d)^{4}\,|\,||Z||^{2}\geq d+t^{2})}\sqrt{P\{||Z||^{2}\geq d+Lt^{2}\}}.

By Corollary 1, there exists a universal positive constant c1c_{1} such that

P{|Z||2d+Lt2}2exp(c1Lmin(t4d,t2)).\sqrt{P\{|Z||^{2}\geq d+Lt^{2}\}}\leq\sqrt{2}\exp\left(-c_{1}L\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

Here, we have used L1L\geq 1. With this term under control, now observe

E((||Z||2d)4|||Z||2d+t2)\displaystyle E((||Z||^{2}-d)^{4}\,|\,||Z||^{2}\geq d+t^{2}) =d+t2(zd)2fd(z)dzP{||Z||2d+t2}\displaystyle=\frac{\int_{d+t^{2}}^{\infty}(z-d)^{2}f_{d}(z)\,dz}{P\{||Z||^{2}\geq d+t^{2}\}}
E((||Z||2d)4)P{||Z||2d+t2}\displaystyle\leq\frac{E((||Z||^{2}-d)^{4})}{P\{||Z||^{2}\geq d+t^{2}\}}
=12d(d+4)P{||Z||2d+t2}.\displaystyle=\frac{12d(d+4)}{P\{||Z||^{2}\geq d+t^{2}\}}.

Hence, by Lemma 17 we have

E((||Z||2d)4|||Z||2d+t4)Cdexp(c2min(t4d,t2)).\sqrt{E((||Z||^{2}-d)^{4}\,|\,||Z||^{2}\geq d+t^{4})}\leq\frac{C^{*}d}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}.

Taking L>c2c11L>\frac{c_{2}}{c_{1}}\vee 1, we have

Var(||Z||2|||Z||2d+t2)\displaystyle\operatorname{Var}(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2}) L2t4+Cdexp(c2min(t4d,t2))2exp(c1Lmin(t4d,t2))\displaystyle\leq L^{2}t^{4}+\frac{C^{*}d}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}\cdot\sqrt{2}\exp\left(-c_{1}L\min\left(\frac{t^{4}}{d},t^{2}\right)\right)
L2t4+dC2exp((c1Lc2)min(t4d,t2)).\displaystyle\leq L^{2}t^{4}+d\cdot C^{*}\sqrt{2}\exp\left(-(c_{1}L-c_{2})\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

which is precisely the desired result. ∎

Appendix C Miscellaneous

C.1 Minimax

Lemma 25.

We have

Γ=μν(ν1)log(1+ps2)n.\Gamma_{\mathcal{H}}=\mu_{\nu_{\mathcal{H}}}\vee\frac{\sqrt{(\nu_{\mathcal{H}}-1)\log\left(1+\frac{p}{s^{2}}\right)}}{n}.

Consequently, Γμν1\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}-1}.

Proof.

Consider that

Γ\displaystyle\Gamma_{\mathcal{H}} =maxν<ν{μννlog(1+ps2)n}maxνν{μννlog(1+ps2)n}\displaystyle=\max_{\nu<\nu_{\mathcal{H}}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}
=maxν<ν{νlog(1+ps2)n}maxνν{μννlog(1+ps2)n}\displaystyle=\max_{\nu<\nu_{\mathcal{H}}}\left\{\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}
=(ν1)log(1+ps2)nμν.\displaystyle=\frac{\sqrt{(\nu_{\mathcal{H}}-1)\log\left(1+\frac{p}{s^{2}}\right)}}{n}\vee\mu_{\nu_{\mathcal{H}}}.

Here, we have used the ordering of the eigenvalues and the fact that μννlog(1+ps2)n\mu_{\nu_{\mathcal{H}}}\leq\frac{\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n} to obtain the second term. The bound Γμν1\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}-1} follows immediately from μνμν1\mu_{\nu_{\mathcal{H}}}\leq\mu_{\nu_{\mathcal{H}}-1} and (ν1)log(1+ps2)nμν1\frac{\sqrt{(\nu_{\mathcal{H}}-1)\log\left(1+\frac{p}{s^{2}}\right)}}{n}\leq\mu_{\nu_{\mathcal{H}}-1} by definition of ν\nu_{\mathcal{H}}. ∎

Proof of Lemma 1.

The result is a special case of Lemma 2. ∎

C.2 Adaptation

Lemma 26.

Suppose a>0a>0. We have

Γ(s,a)=μν(s,a)(ν(s,a)1)log(1+pas2)n.\Gamma_{\mathcal{H}}(s,a)=\mu_{\nu_{\mathcal{H}}(s,a)}\vee\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

Consequently, Γ(s,a)μν(s,a)1\Gamma_{\mathcal{H}}(s,a)\leq\mu_{\nu_{\mathcal{H}}(s,a)-1}.

Proof.

Consider that

Γ(s,a)\displaystyle\Gamma_{\mathcal{H}}(s,a) =maxν<ν(s,a){μννlog(1+pas2)n}maxνν(s,a){μννlog(1+pas2)n}\displaystyle=\max_{\nu<\nu_{\mathcal{H}}(s,a)}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}(s,a)}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}
=maxν<ν(s,a){νlog(1+pas2)n}maxνν(s,a){μννlog(1+pas2)n}\displaystyle=\max_{\nu<\nu_{\mathcal{H}}(s,a)}\left\{\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}(s,a)}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}
=(ν(s,a)1)log(1+pas2)nμν(s,a).\displaystyle=\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\vee\mu_{\nu_{\mathcal{H}}(s,a)}.

Here, we have used the ordering of the eigenvalues and the fact that μν(s,a)ν(s,a)log(1+pas2)n\mu_{\nu_{\mathcal{H}}(s,a)}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n} to obtain the second term. To bound Γμν(s,a)1\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}(s,a)-1} follows immediately from μνμν(s,a)1\mu_{\nu_{\mathcal{H}}}\leq\mu_{\nu_{\mathcal{H}}(s,a)-1} and (ν(s,a)1)log(1+pas2)nμν(s,a)1\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\mu_{\nu_{\mathcal{H}}(s,a)-1} by definition of ν(s,a)\nu_{\mathcal{H}}(s,a). ∎

Proof of Lemma 2.

We first prove the second inequality. Since log(1+pas2)n2\log\left(1+\frac{pa}{s^{2}}\right)\leq\frac{n}{2} and μ1=1\mu_{1}=1, it immediately follows that ν(s,a)2\nu_{\mathcal{H}}(s,a)\geq 2. Therefore,

ν(s,a)log(1+pas2)n\displaystyle\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n} 2(ν(s,a)1)log(1+pas2)n\displaystyle\leq\sqrt{2}\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}
2(μν(s,a)(ν(s,a)1)log(1+pas2)n)\displaystyle\leq\sqrt{2}\left(\mu_{\nu_{\mathcal{H}}(s,a)}\vee\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right)
2Γ(s,a)\displaystyle\leq\sqrt{2}\Gamma_{\mathcal{H}}(s,a)

as desired.

Now we prove the first inequality. For any νν(s,a)\nu\geq\nu_{\mathcal{H}}(s,a), we have by the ordering of the eigenvalues

μννlog(1+pas2)nμνμν(s,a)ν(s,a)log(1+pas2)n.\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\mu_{\nu}\leq\mu_{\nu_{\mathcal{H}}}(s,a)\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

For any ν<ν(s,a)\nu<\nu_{\mathcal{H}}(s,a), by definition of ν(s,a)\nu_{\mathcal{H}}(s,a) it follows that

μννlog(1+pas2)n=νlog(1+pas2)nν(s,a)log(1+pas2)n.\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}=\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

Since for all ν\nu we have shown

μννlog(1+pas2)nν(s,a)log(1+pas2)n,\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n},

taking max over ν\nu yields

Γ(s,a)ν(s,a)log(1+pas2)n\Gamma_{\mathcal{H}}(s,a)\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}

as desired. ∎

Proof of Lemma 3.

For ease of notation, let us write a=𝒜a=\mathscr{A}_{\mathcal{H}}. By the assumption log(1+pa)n2\log\left(1+pa\right)\leq\frac{n}{2} and μ1=1\mu_{1}=1 we have ν(s,a)2\nu_{\mathcal{H}}(s,a)\geq 2 for all ss. By definition of ν(s,a)\nu_{\mathcal{H}}(s,a), we have

μν(s,a)ν(s,a)log(1+pas2)n\mu_{\nu_{\mathcal{H}}(s,a)}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}

and

μν(s,a)1>(ν(s,a)1)log(1+pas2)n\mu_{\nu_{\mathcal{H}}(s,a)-1}>\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}

Define

δs:=inf{δ>0:μν(s,a)1(ν(s,a)1)log(1+p(a+δ)s2)n}.\delta_{s}:=\inf\left\{\delta>0:\mu_{\nu(s,a)-1}\leq\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{p(a+\delta)}{s^{2}}\right)}}{n}\right\}.

It is clear δs>0\delta_{s}>0 for each ss since the inequality in the penultimate display is strict. Moreover, observe that ν(s,a+δ)>ν(s,a)1\nu_{\mathcal{H}}(s,a+\delta)>\nu_{\mathcal{H}}(s,a)-1 for all δ<δs\delta<\delta_{s} by definition of ν\nu_{\mathcal{H}}. Moreover, since ν\nu_{\mathcal{H}} is decreasing in its second argument, it thus follows that ν(s,a+δ)=ν(s,a)\nu_{\mathcal{H}}(s,a+\delta)=\nu_{\mathcal{H}}(s,a) for all δ<δs\delta<\delta_{s}. Letting δ:=1mins[p]δs\delta^{*}:=1\wedge\min_{s\in[p]}\delta_{s}, it immediately follows that 𝒱a=𝒱a+δ\mathscr{V}_{a}=\mathscr{V}_{a+\delta} for all δ<δ\delta<\delta^{*}. By definition of 𝒜\mathscr{A}_{\mathcal{H}}, it follows that for any δ<δ\delta<\delta^{*}

𝒜=alog(e|𝒱a|)=log(e|𝒱a+δ|)<a+δ2𝒜\mathscr{A}_{\mathcal{H}}=a\leq\log(e|\mathscr{V}_{a}|)=\log(e|\mathscr{V}_{a+\delta}|)<a+\delta\leq 2\mathscr{A}_{\mathcal{H}}

where we have used 𝒜1\mathscr{A}_{\mathcal{H}}\geq 1 and δ1\delta\leq 1 to obtain the final inequality. The proof is complete. ∎

C.3 Technical

Proposition 5 (Ingster-Suslina method [24]).

Suppose Σd×d\Sigma\in\mathbb{R}^{d\times d} is a positive definite matrix and Θd\Theta\subset\mathbb{R}^{d} is a parameter space. If π\pi is a probability distribution supported on Θ\Theta, then

χ2(N(θ,Σ)π(dθ)||N(0,Σ))=E(exp(θ,Σ1θ~))1\chi^{2}\left(\left.\left.\int N(\theta,\Sigma)\,\pi(d\theta)\right|\right|N(0,\Sigma)\right)=E\left(\exp\left(\left\langle\theta,\Sigma^{-1}\widetilde{\theta}\right\rangle\right)\right)-1

where θ,θ~iidπ\theta,\widetilde{\theta}\overset{iid}{\sim}\pi and χ2(||)\chi^{2}(\cdot||\cdot) denotes the χ2\chi^{2}-divergence defined in Section 1.1.

Lemma 27 ([45]).

If P,QP,Q are probability measures on a measurable space (𝒳,𝒜)(\mathcal{X},\mathcal{A}) with PQP\ll Q, then dTV(P,Q)12χ2(P||Q)d_{TV}(P,Q)\leq\frac{1}{2}\sqrt{\chi^{2}(P||Q)}.

Lemma 28.

If YY is distributed according to the hypergeometric distribution with probability mass function P{Y=k}=(sk)(nstk)(nt)P\{Y=k\}=\frac{\binom{s}{k}\binom{n-s}{t-k}}{\binom{n}{t}} for 0kst0\leq k\leq s\wedge t, then E(exp(λY))(1sn+sneλ)tE(\exp(\lambda Y))\leq\left(1-\frac{s}{n}+\frac{s}{n}e^{\lambda}\right)^{t} for λ>0\lambda>0.

Corollary 3 ([10]).

If YY is distributed according to the hypergeometric distribution with probability mass function P{Y=k}=(sk)(nssk)(ns)P\{Y=k\}=\frac{\binom{s}{k}}{\binom{n-s}{s-k}}{\binom{n}{s}} for 0ksn0\leq k\leq s\leq n, then E(Y)=s2nE(Y)=\frac{s^{2}}{n} and E(exp(λY))(1sn+sneλ)sE(\exp(\lambda Y))\leq\left(1-\frac{s}{n}+\frac{s}{n}e^{\lambda}\right)^{s} for λ>0\lambda>0.