Minimax Signal Detection in Sparse Additive Models

Subhodh Kotekal and Chao Gao
University of Chicago

Abstract

Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing kernel Hilbert space. Unlike the estimation theory, the minimax separation rate reveals a nontrivial interaction between sparsity and the choice of function space. We also investigate adaptation to sparsity and establish an adaptive testing rate for a generic function space; adaptation is possible in some spaces while others impose an unavoidable cost. Finally, adaptation to both sparsity and smoothness is studied in the setting of Sobolev space, and we correct some existing claims in the literature.

1 Introduction

In the interest of interpretability, computation, and circumventing the statistical curse of dimensionality plaguing high dimensional regression, structure is often assumed on the true regression function. Indeed, it might plausibly be argued that sparse linear regression is the distinguishing export of modern statistics. Despite its popularity, circumstances may call for more flexibility to capture nonlinear effects of the covariates. Striking a balance between flexibility and structure, Hastie and Tibshirani [20] proposed generalized additive models (GAMs) as a natural extension to the vaunted linear model. In a GAM, the regression function admits an additive decomposition of univariate (nonlinear) component functions. However, as in the linear model, the sample size must outpace the dimension for consistent estimation. Following modern statistical instinct, a sparse additive model is compelling [33, 38, 29, 50, 35, 39]. The regression function admits an additive decomposition of univariate functions for which only a small subset are nonzero; it is the combination of a GAM and sparsity.

To fix notation, consider the $p$ -dimensional Gaussian white noise model

dY_{x}=f(x)\,dx+\frac{1}{\sqrt{n}}dB_{x}

(1)

for $x\in[0,1]^{p}$ . Though it may be more faithful to practical data analysis to consider the nonparametric regression model $Y_{i}=f(X_{i})+\epsilon_{i}$ (e.g. as in [38]), the Gaussian white noise model is convenient as it avoids distracting technicalities while maintaining focus on the statistical essence. Indeed, the nonparametric statistics literature has a long history of studying the white noise model to understand theoretical limits, relying on well-known asymptotic equivalences [5, 40] which imply, under various conditions, that mathematical results obtained in one model can be ported over to the other model. As our focus is theoretical rather than practical, we follow in this tradition. The generalized additive model asserts the regression function is of the form $f(x)=\sum_{j=1}^{p}f_{j}(x_{j})$ with $f_{1},...,f_{p}$ being univariate functions belonging to some function space $\mathcal{H}$ . Likewise, the sparse additive model asserts

f(x)=\sum_{j\in S}f_{j}(x_{j})

for some unknown support set $S$ of size $s$ denoting the active covariates.

Most of the existing literature has addressed estimation of sparse additive models, primarily in the nonparametric regression setting and with $\mathcal{H}$ being a reproducing kernel Hilbert space. After a series of works [33, 29, 35, 39], Raskutti et al. [38] (see also [29]) established that a penalized $M$ -estimator achieves the minimax estimation rate under various choices of the reproducing kernel Hilbert space $\mathcal{H}$ . Yuan and Zhou [50] establish minimax estimation rates under a notion of approximate sparsity. As is now seen as typical of estimation theory, the powerful framework of empirical processes is brought down to bear on their proofs. Some articles have also addressed generalizations of the sparse additive model. For example, the authors of [49] consider, among other structures, an additive signal $f(x)=\sum_{j=1}^{k}f_{j}(x)$ where each component function $f_{j}$ is actually a multivariate function depending on at most $s_{j}$ many coordinates and is $\alpha_{j}$ -Hölder. The authors go on to derive minimax rates that handle heterogeneity in the smoothness indices and the sparsities of the coordinate functions; as a taste of their results, they show, in a particular regime and under some conditions, the rate $kn^{-\frac{2\alpha}{2\alpha+s}}+ks\log\left(\frac{p}{s}\right)/n$ in the special, homogeneous case where $s_{j}=s$ and $\alpha_{j}=\alpha$ for all $j$ . Recently, the results of [3] show certain deep neural networks can achieve the minimax estimation rate for sparse $k$ -way interaction models. The $k$ -way interaction model is also known as nonparametric ANOVA. To give an example, the sparse $2$ -way interaction model assumes $f(x)=\sum_{j\in S_{1}}f_{j}(x_{j})+\sum_{(k,l)\in S_{2}}f_{kl}(x_{k},x_{l})$ where the sets of active variables $S_{1}$ and interactions $S_{2}$ have small cardinalities. When the $f_{j}$ are $\beta_{1}$ -Hölder and the $f_{kl}$ are $\beta_{2}$ -Hölder, [3] establishes, under some conditions and up to factors logarithmic in $n$ , the rate $s_{1}(n^{-\frac{2\beta_{1}}{2\beta_{1}+1}}+(\log p)/n)+s_{2}(n^{-\frac{2\beta_{2}}{2\beta_{2}+2}}+(\log p)/n)$ .

The literature has much less to say on the problem of signal detection

	$\displaystyle H_{0}$	$\displaystyle:f\equiv 0,$		(2)
	$\displaystyle H_{1}$	$\displaystyle:\|\|f\|\|_{2}\geq\varepsilon\text{ and }f\in\mathcal{F}_{s}$		(3)

where $\mathcal{F}_{s}$ is the class of sparse additive signals given by (4). Adopting a minimax perspective [6, 23, 22, 21, 24], the goal is to determine the smallest rate $\varepsilon$ as a function of the sparsity level $s$ , the dimension $p$ , the sample size $n$ , and the function space $\mathcal{H}$ such that consistent detection of the alternative against the null is possible.

Though to a much lesser extent than the corresponding estimation theory, optimal testing rates have been established in various high dimensional settings other than sparse additive models. The most canonical setup, the Gaussian sequence model, is perhaps the most studied [24, 2, 41, 34, 30, 8, 14, 10, 7, 12, 18]. Optimal rates have also been established in linear regression [1, 26, 37] and other models [36, 13]. A common motif is that optimal testing rates exhibit different phenomena from optimal estimation rates.

Returning to (2)-(3), the only directly relevant works in the literature are Ingster and Lepski’s article [25] and a later article by Gayraud and Ingster [17]. Ingster and Lepski [25] consider a sparse multichannel model which, after a transformation to sequence space, is closely related to (2)-(3). Adopting an asymptotic setting and exclusively choosing $\mathcal{H}$ to be a Sobolev space, they establish asymptotic minimax separation rates. However, their results only address the regimes $p=O(s^{2})$ and $s=O(p^{1/2-\delta})$ for a constant $\delta\in(0,1/2)$ . Their result does not precisely pin down the rate near the phase transition $s\asymp\sqrt{p}$ . In essence, their testing procedure in the sparse regime is to apply a Bonferroni correction to a collection of $\chi^{2}$ -type tests, one test per each of the $p$ coordinates. Thus, a gap in their rate near $\sqrt{p}$ is unsurprising. In the dense regime, a single $\chi^{2}$ -type test is used, as is typical in minimax testing literature. Ingster and Lepski [25] also address adaptation to sparsity as well as adaptation both the sparsity and the smoothness. Ingster and Gayraud [17] consider sparse additive models rather than a sparse multichannel model but make the same choice of $\mathcal{H}$ and work in an asymptotic setup. They establish sharp constants for the sparse case $s=p^{1/2-\delta}$ via a Higher Criticism type testing statistic. Throughout the paper, we make comparisons of our results to primarily [25] as it was the first article to establish rates. Our results do not address the question of sharp constants.

Our paper’s main contributions are the following. First, adopting a nonasymptotic minimax testing framework as initiated in [2], we establish the nonasymptotic minimax separation rate for (2)-(3) for any configuration of the sparsity $s$ , dimension $p$ , sample size $n$ , and a generic function space $\mathcal{H}$ . Notably, we do not restrict ourselves to Sobolev (or Besov) space as in [25, 17]. The test procedure we analyze involves thresholded $\chi^{2}$ statistics, following a strategy employed in other sparse signal detection problems [10, 30, 8, 9, 34].

Our second contribution is to establish an adaptive testing rate for a generic function space. Typically, the sparsity level is unknown, and it is of practical interest to have a methodology which can accommodate a generic $\mathcal{H}$ . Interestingly, some choices of $\mathcal{H}$ do not involve any cost of adaptation, that is, the minimax rate can be achieved without knowing the sparsity. Our rate’s incurred adaptation cost turns out to be a delicate function of $\mathcal{H}$ , thus extending Ingster and Lepski’s focus on Sobolev spaces [25]. Even in the Sobolev case, our result extends upon their article; near the regime $s\asymp\sqrt{p}$ , our test provides finer detail by incurring a cost involving only $\log\log\log p$ instead of $\log\log p$ as incurred by their test. In principle, our result can be used to reverse the process and find a space $\mathcal{H}$ for which this adaptive rate incurs a given adaptation cost.

Finally, adaptation to both sparsity and smoothness is studied in the context of Sobolev space. We identify an issue with and correct a claim made by [25].

1.1 Notation

The following notation will be used throughout the paper. For $p\in\mathbb{N}$ , let $[p]:=\{1,...,p\}$ . For $a,b\in\mathbb{R}$ , denote $a\vee b:=\max\{a,b\}$ and $a\wedge b=\min\{a,b\}$ . Denote $a\lesssim b$ to mean there exists a universal constant $C>0$ such that $a\leq Cb$ . The expression $a\gtrsim b$ means $b\lesssim a$ . Further, $a\asymp b$ means $a\lesssim b$ and $b\lesssim a$ . The symbol $\langle\cdot,\cdot\rangle$ denotes the usual inner product in Euclidean space and $\langle\cdot,\cdot\rangle_{F}$ denotes the Frobenius inner product. The total variation distance between two probability measures $P$ and $Q$ on a measurable space $(\mathcal{X},\mathcal{A})$ is defined as $d_{TV}(P,Q):=\sup_{A\in\mathcal{A}}|P(A)-Q(A)|$ . If $Q$ is absolutely continuous with respect to $P$ , the $\chi^{2}$ -divergence is defined as $\chi^{2}(Q\,||\,P):=\int_{\mathcal{X}}\left(\frac{dQ}{dP}-1\right)^{2}\,dP$ . For a finite set $S$ , the notation $|S|$ denotes the cardinality of $S$ . Throughout, iterated logarithms will be used (e.g. expressions like $\log\log\log p$ and $\log\log(np)$ ). Without explicitly stating so, we will take such an expression to be equal to some universal constant if otherwise it would be less than one. For example, $\log\log\log p$ should be understood to be equal to a universal constant when $p<e^{e^{e}}$ .

1.2 Setup

1.2.1 Reproducing Kernel Hilbert Space (RKHS)

Following the literature (e.g. [47]), $\mathcal{H}$ will denote a reproducing kernel Hilbert space (RKHS). Before discussing main results, basic properties of RKHSs are reviewed [47]. Suppose $\mathcal{H}\subset L^{2}([0,1])$ is a reproducing kernel Hilbert space (RKHS) with associated inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ . There exists a symmetric function $K:[0,1]\times[0,1]\to\mathbb{R}_{+}$ called a kernel such that for any $x\in[0,1]$ we have (1) the function $K(\cdot,x)\in\mathcal{H}$ and (2) for all $f\in\mathcal{H}$ we have $f(x)=\langle f,K(\cdot,x)\rangle_{\mathcal{H}}$ . Mercer’s theorem (Theorem 12.20 in [47]) guarantees that the kernel $K$ admits an expansion in terms of eigenfunctions $\left\{\psi_{k}\right\}_{k=1}^{\infty}$ , namely $K(x,x^{\prime})=\sum_{k=1}^{\infty}\mu_{k}\psi_{k}(x)\psi_{k}(x^{\prime})$ . To give examples, the kernel $K(x,x^{\prime})=1+\min\{x,x^{\prime}\}$ defines the first-order Sobolev space with eigenvalue decay $\mu_{k}\asymp k^{-2}$ , and the kernel $K(x,x^{\prime})=\exp\left(-\frac{(x-x^{\prime})^{2}}{2}\right)$ exhibits eigenvalue decay $\mu_{k}\asymp e^{-ck\log k}$ (see [47] for a more detailed review).

Without loss of generality, we order the eigenvalues $\mu_{1}\geq\mu_{2}\geq...\geq 0$ . The eigenfunctions $\left\{\psi_{k}\right\}_{k=1}^{\infty}$ are orthonormal in $L^{2}([0,1])$ under the usual $L^{2}$ inner product $\langle\cdot,\cdot\rangle_{L^{2}}$ and the inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ satisfies $\langle f,g\rangle_{\mathcal{H}}=\sum_{k=1}^{\infty}\mu_{k}^{-1}\langle f,\psi_{k}\rangle_{L^{2}}\langle g,\psi_{k}\rangle_{L^{2}}$ for $f,g\in\mathcal{H}$ .

1.2.2 Parameter space

The parameter space which will be used throughout the paper is defined in this section. Suppose $\mathcal{H}$ is an RKHS. Recall we are interested in sparse additive signals $f(x)=\sum_{j\in S}f_{j}(x_{j})$ for some sparsity pattern $S\subset[p]$ . Following [17, 38], for convenience we assume $\int_{0}^{1}f_{j}(t)\,dt=0$ for all $j$ . This assumption is mild and can be relaxed; further discussion can be found in Section 5.1. Letting $\mathbf{1}\in L^{2}([0,1])$ denote the constant function equal to one, consider that $\mathcal{H}_{0}:=\mathcal{H}\cap\operatorname{span}\left\{\mathbf{1}\right\}^{\perp}$ is a closed subspace of $\mathcal{H}$ . Hence, $\mathcal{H}_{0}$ is also an RKHS. We will put aside $\mathcal{H}$ (along with its eigenfunctions and eigenvalues) and only work with $\mathcal{H}_{0}$ . Let $\{\psi_{k}\}_{k=1}^{\infty}$ and $\left\{\mu_{k}\right\}_{k=1}^{\infty}$ denote its associated eigenfunctions and eigenvalues respectively. Following [2], we assume $\mu_{1}=1$ also for convenience; this can easily be relaxed.

For each subset $S\subset[p]$ , define

\mathcal{H}_{S}:=\left\{f(x)=\sum_{j\in S}f_{j}(x_{j}):f_{j}\in\mathcal{H}_{0}\text{ and }||f_{j}||_{\mathcal{H}_{0}}\leq 1\text{ for all }j\in S\right\}.

The condition on the RKHS norm enforces regularity. Define the parameter space

\mathcal{F}_{s}:=\bigcup_{\begin{subarray}{c}S\subset[p],\\ |S|\leq s\end{subarray}}\mathcal{H}_{S}

(4)

for each $1\leq s\leq p$ .

Following [25, 17], it is convenient to transform (1) from function space to sequence space. The tensor product $L^{2}([0,1])^{\otimes p}$ admits the orthonormal basis $\left\{\Phi_{\ell}\right\}_{\ell\in\mathbb{N}^{p}}$ with $\Phi_{\ell}(t)=\prod_{j=1}^{p}\psi_{\ell_{j}}(t_{j})$ for $t\in[0,1]^{p}$ . For ease of notation, denote $\phi_{j,k}(t)=\psi_{k}(t_{j})$ for $k\in\mathbb{N}$ and $j\in[p]$ . Define the random variables

\displaystyle X_{k,j}

\displaystyle:=\int_{[0,1]^{p}}\phi_{j,k}(x)\,dY_{x}\sim N(\theta_{k,j},n^{-1})

(5)

where $\theta_{k,j}=\int_{0}^{1}\psi_{k}(x)f_{j}(x)\,dx$ . The assumption $\int_{0}^{1}f_{j}(t)\,dt=0$ for all $j$ is used here. Note by orthogonality that $\{X_{k,j}\}_{k\in\mathbb{N},j\in[p]}$ is a collection of independent random variables. The notation $\Theta\in\mathbb{R}^{\mathbb{N}\times p}$ will frequently be used to denote the full matrix of coefficients. The notation $\Theta_{j}\in\mathbb{R}^{\mathbb{N}}$ will also be used to denote the $j$ th column of $\Theta$ . For $f\in\mathcal{F}_{s}$ , the corresponding set of coefficients is

\mathscr{T}_{s}:=\left\{\Theta\in\mathbb{R}^{\mathbb{N}\times p}:\sum_{j=1}^{p}\mathbbm{1}_{\{\Theta_{j}\neq 0\}}\leq s\text{ and }\sum_{k=1}^{\infty}\frac{\theta_{k,j}^{2}}{\mu_{k}}\leq 1\text{ for all }j\right\}.

(6)

Note the parameter spaces $\mathcal{F}_{s}$ and $\mathscr{T}_{s}$ are in correspondence, and we will frequently write $f$ and $\Theta$ freely in the same context without comment. The understanding is $f$ is a function and $\Theta$ is its corresponding basis coefficients. The notation $E_{f},P_{f},E_{\Theta}$ , and $P_{\Theta}$ will be used to denote expectations and probability measures with respect to the denoted parameters.

1.2.3 Problem

As described earlier, given an observation from (1) the signal detection problem (2)-(3) is of interest. The goal is to characterize the nonasymptotic minimax separation rate $\varepsilon_{\mathcal{H}}^{*}=\varepsilon_{\mathcal{H}}^{*}(p,s,n)$ .

Definition 1.

We say $\varepsilon^{*}_{\mathcal{H}}=\varepsilon_{\mathcal{H}}^{*}(p,s,n)$ is the nonasymptotic minimax separation rate for the problem (2)-(3) if

(i)

for all $\eta\in(0,1)$ , there exists $C_{\eta}>0$ depending only on $\eta$ such that for all $C>C_{\eta}$ ,

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\varepsilon_{\mathcal{H}}^{*}(p,s,n)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\leq\eta,

(ii)

for all $\eta\in(0,1)$ , there exists $c_{\eta}>0$ depending only on $\eta$ such that for all $0<c<c_{\eta}$ ,

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\varepsilon_{\mathcal{H}}^{*}(p,s,n)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta.

The nonasymptotic minimax testing framework was initiated by Baraud [2] and characterizes (up to universal factors) the fundamental statistical limit of the detection problem. The framework is nonasymptotic in the sense that the conditions for the minimax separation rate hold for any configuration of the problem parameters.

Since $||f||_{2}^{2}=\sum_{j=1}^{p}\sum_{k=1}^{\infty}\theta_{k,j}^{2}$ , the problem can be equivalently formulated in sequence space as

	$\displaystyle H_{0}$	$\displaystyle:\Theta\equiv 0,$		(7)
	$\displaystyle H_{1}$	$\displaystyle:\|\|\Theta\|\|_{F}\geq\varepsilon\text{ and }\Theta\in\mathscr{T}_{s}.$		(8)

This testing problem, with the parameter space (6), is interesting its own right outside the sparse additive model and RKHS context. Indeed, the testing problem (7)-(8) is essentially a group-sparse extension of the detection problem in ellipses considered in Baraud’s seminal article [2]. In fact, this interpretation was actually our initial motivation to study the detection problem. The connection to sparse additive models was a later consideration; similar to the way in which the later article [17] considers sparse additive models when building upon the earlier, fundamental work [25] dealing with a sparse multichannel (essentially group-sparse) model. Taking the perspective of a sequence problem has a long history in nonparametric regression [24, 25, 2, 15, 27, 48, 40, 5] due to not only its fundamental connections but also its advantage in distilling the problem to its essence and dispelling technical distractions. Our results can be exclusively read (and readers are encouraged to do so) in the context of the sequence problem (7)-(8).

2 Minimax rates

We begin by describing some high-level and natural intuition before informally stating our main result in Section 2.1. Section 2.2 contains the development of some key quantities. In Sections 2.3 and 2.4, we formally state minimax lower and upper bounds respectively. In Section 2.5, some special cases illustrating the general result are discussed.

2.1 A naive ansatz

A first instinct is to look to previous results for context in an attempt to make a conjecture regarding the optimal testing rate. To illustrate how this line of thinking might proceed, consider the classical Gaussian white noise model on the unit interval in one dimension,

dY_{x}=f(x)\,dx+\frac{1}{\sqrt{n}}dB_{x}

for $x\in[0,1]$ . Assume $f$ lives inside the unit ball of a reproducing kernel Hilbert space $\mathcal{H}$ and thus admits a decomposition in the associated orthonormal basis with a coefficient vector $\theta\in\ell^{2}(\mathbb{N})$ living in an ellipsoid determined by the kernel’s eigenvalues $\mu_{1}\geq\mu_{2}\geq...\geq 0$ . The optimal rate of estimation is given by Birgé and Massart [4]

\epsilon_{\text{est}}^{2}\asymp\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\nu}{n}\right\}.

Baraud [2] established that the minimax separation rate for the signal detection problem

	$\displaystyle H_{0}$	$\displaystyle:f\equiv 0,$
	$\displaystyle H_{1}$	$\displaystyle:\|\|f\|\|_{2}\geq\varepsilon\text{ and }\|\|f\|\|_{\mathcal{H}}\leq 1$

is given by

\epsilon_{\text{test}}^{2}\asymp\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu}}{n}\right\}.

In both estimation and testing, the maximizer $\nu^{*}$ can be conceptualized as the correct truncation order. Specifically for testing, Baraud’s procedure [2], working in sequence space, rejects $H_{0}$ when $\sum_{k=1}^{\nu^{*}}X_{k}^{2}-\frac{\nu^{*}}{n}\gtrsim\frac{\sqrt{\nu^{*}}}{n}$ . The data for $k>\nu^{*}$ are not used, and it is in this sense $\nu^{*}$ is understood as a truncation level. To illustrate these results, the rates for Sobolev spaces with smoothness $\alpha$ are $\epsilon_{\text{est}}^{2}\asymp n^{-\frac{2\alpha}{2\alpha+1}}$ and $\epsilon_{\text{test}}^{2}\asymp n^{-\frac{4\alpha}{4\alpha+1}}$ since $\mu_{\nu}\asymp\nu^{-2\alpha}$ . By now, these nonasymptotic results of [4, 2] are well known and canonical.

Moving to the setting of sparse additive signals in the model (1), Raskutti et al. [38] derive the nonasymptotic minimax rate of estimation

\frac{s\log p}{n}+s\epsilon_{\text{est}}^{2}.

While their upper bound holds for any choice of $\mathcal{H}$ (satisfying some mild conditions) and sparsity level $s$ , they only obtain a matching lower bound when $s=o(p)$ and when the unit ball of $\mathcal{H}$ has logarithmic or polynomial scaling metric entropy. This rate obtained by Raskutti et al. [38] is pleasing. It is quite intuitive to see the term $\frac{s\log p}{n}$ due to the sparsity in the parameter space. The term $s\epsilon_{\text{est}}^{2}$ is the natural rate for estimating $s$ many univariate functions in $\mathcal{H}$ as if sparsity pattern were known. Notably, there is no interaction between the choice of $\mathcal{H}$ and the sparsity structure. The sparsity term $\frac{s\log p}{n}$ is independent of $\mathcal{H}$ and the estimation term $s\epsilon_{\text{est}}^{2}$ is dimension free.

One might intuit that this lack of interaction is a general phenomenon. Instinct may suggest that signal detection in sparse additive models is also just $s$ many instances of a univariate nonparametric detection problem plus the problem of a detecting a signal which is nonzero on an unknown sparsity pattern. Framing it as two distinct problems, one might conjecture the optimal testing rate should be $s\epsilon_{\text{test}}^{2}$ plus the $s$ -sparse detection rate. Collier et al. [10] provide a natural candidate for the $s$ -sparse testing rate, namely the rate $\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}$ for $s<\sqrt{p}$ and $\frac{\sqrt{p}}{n}$ for $s\geq\sqrt{p}$ .

However, this is not the case as a quick glance at [25] falsifies the conjecture for the case of Sobolev $\mathcal{H}$ . Though quick to dispel hopeful thinking, [25] expresses little about the interaction between sparsity and $\mathcal{H}$ . Our result explicitly captures this nontrivial interaction for a generic $\mathcal{H}$ . We show the minimax separation rate is given by

\displaystyle\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}

\displaystyle\asymp s\wedge\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+s\cdot\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}&\textit{if }s<\sqrt{p},\\ s\cdot\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}&\textit{if }s\geq\sqrt{p}.\end{cases}

(9)

The rate bears some resemblance to the sparse testing rate of Collier et al. [10] and the nonparametric testing rate of Baraud [2], but the combination of the two is not a priori straightforward. At this point in discussion, not enough has been developed to explain how the form of the rate arises. Various features of the rate will be commented on later on in the paper.

2.2 Preliminaries

In this section, some key pieces are defined.

Definition 2.

Define $\Gamma_{\mathcal{H}}=\Gamma_{\mathcal{H}}(p,s,n)$ to be the quantity

\Gamma_{\mathcal{H}}:=\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}.

Note, since $\mu_{1}=1$ , it follows $\Gamma_{\mathcal{H}}\gtrsim\frac{1}{n}$ . It is readily seen from (9) there are two broad regimes to consider. When $n\lesssim\log(1+p/s^{2})$ , we have $\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp s$ . In the regime $n\gtrsim\log(1+p/s^{2})$ , the rate is more complicated. The first regime is really a trivial regime since any signal $f\in\mathcal{F}_{s}$ must satisfy $||f||_{2}^{2}\leq s$ by virtue of $\mu_{1}=1$ . Therefore, the degenerate test which always accepts $H_{0}$ vacuously detects sparse additive signals with $||f||_{2}^{2}\geq 2s$ . Hence, the upper bound $\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\lesssim s$ is trivially achieved. It turns out a matching lower bound can be proved which establishes that the regime $n\lesssim\log(1+p/s^{2})$ is fundamentally trivial; see Section 2.3 for a formal statement.

More generally, the form (9) is useful when discussing lower bounds. A different form is more convenient when discussing upper bounds, and it is a form which is familiar in the context of [10, 11].

Definition 3.

Define $\nu_{\mathcal{H}}$ to be the smallest positive integer $\nu$ such that

\mu_{\nu}\leq\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}.

As the next lemma shows, $\nu_{\mathcal{H}}$ is essentially the solution to the maximization problem in the definition of $\Gamma_{\mathcal{H}}$ . Drawing an analogy to the result of Baraud [2] described in Section 2.1, $\nu_{\mathcal{H}}$ can be conceptualized as the correct order of truncation accounting for the dimension and sparsity.

Lemma 1.

If $\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}$ , then $\Gamma_{\mathcal{H}}\leq\frac{\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}\leq\sqrt{2}\Gamma_{\mathcal{H}}$ .

With Lemma 1, the testing rate can be expressed as

\varepsilon^{*}_{\mathcal{H}}(p,s,n)^{2}\asymp\begin{cases}\frac{s}{n}\log\left(1+\frac{p}{s^{2}}\right)+\frac{s}{n}\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}&\textit{if }s<\sqrt{p},\\ \frac{\sqrt{p\nu_{\mathcal{H}}}}{n}&\textit{if }s\geq\sqrt{p}.\end{cases}

(10)

The condition $\log\left(1+\frac{p}{s^{2}}\right)\lesssim n$ in Lemma 1 is natural in light of the triviality which occurs when $n\lesssim\log\left(1+\frac{p}{s^{2}}\right)$ .

2.3 Lower bound

In this section, a formal statement of a lower bound on the minimax separation rate is given. Define

\psi(p,s,n)^{2}:=\begin{cases}\frac{s\log\left(1+\frac{p}{s^{2}}\right)}{n}\vee s\Gamma_{\mathcal{H}}&\textit{if }s<\sqrt{p},\\ s\Gamma_{\mathcal{H}}&\textit{if }s\geq\sqrt{p}.\end{cases}

(11)

First, it is shown that the testing problem is trivial if $\log\left(\frac{p}{s^{2}}\right)\gtrsim n$ .

Proposition 1 (Triviality).

Suppose $1\leq s\leq p$ . Suppose $\kappa>0$ and $\log\left(1+\frac{p}{s^{2}}\right)\geq\kappa n$ . If $\eta\in(0,1)$ , then for any $0<c<1\wedge\sqrt{\kappa}\wedge\sqrt{\kappa\log\left(1+4\eta^{2}\right)}$ we have

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\sqrt{s}\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta.

Proposition 1 asserts that in order to achieve small testing error, a necessary condition is that $||f||_{2}\geq C\sqrt{s}$ for a sufficiently large $C>0$ . To see why Proposition 1 is a statement about triviality, observe that $\mathcal{F}_{s}\subset\left\{f\in L^{2}([0,1]^{p}):||f||_{2}\leq\sqrt{s}\right\}$ . To reiterate plainly, all potential $f\in\mathcal{F}_{s}$ in the model (1) live inside the ball of radius $\sqrt{s}$ . There are essentially no functions $f\in\mathcal{F}_{s}$ that have detectable norm when $\log\left(\frac{p}{s^{2}}\right)\gtrsim n$ , and so the problem is essentially trivial.

The lower bound construction is straightforward. Working in sequence space, consider an alternative hypothesis with a prior $\pi$ in which a draw $\Theta\sim\pi$ is obtained by drawing a size $s$ subset $S\subset[p]$ uniformly at random and setting $\theta_{k,j}=\rho$ if $k=1$ and $j\in S$ or setting $\theta_{k,j}=0$ otherwise. The value of $\rho$ determines the separation between the null and alternative hypotheses since $||\Theta||_{F}^{2}=s\rho^{2}$ . However, $\rho$ must respect some constraints to ensure $\pi$ is supported on the parameter space and that it is impossible to distinguish the null and alternative hypotheses with vanishing error. Observe this construction places us in a one-dimensional sparse sequence model (since $\theta_{k,j}=0$ for $k\geq 2$ ), which is precisely the setting of [10]. From their results, it is seen we must have $\rho^{2}\lesssim\log\left(1+\frac{p}{s^{2}}\right)/n$ . To ensure $\pi$ is supported on the parameter space, we must have $\sum_{k=1}^{\infty}\theta_{k,j}^{2}/\mu_{k}\leq 1$ for all $j\in[p]$ . Since $\mu_{1}=1$ , it follows $\sum_{k=1}^{\infty}\theta_{k,j}^{2}/\mu_{k}=\theta_{1,j}^{2}\leq\rho^{2}$ , and so the constraint $\rho\lesssim 1$ must be enforced. When $\log\left(1+p/s^{2}\right)\gtrsim n$ , only the second condition $\rho\lesssim 1$ is binding, and so the largest separation we can achieve is $||\Theta||_{F}^{2}\asymp s$ . Hence, the problem is trivial in this regime.

To ensure non-triviality, it will be assumed $\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}$ . The choice of the factor $1/2$ is only for convenience and is not essential. In fact, the condition $\log\left(1+\frac{p}{s^{2}}\right)<n$ would always suffice for our purposes, and the condition $\log\left(1+\frac{p}{s^{2}}\right)\leq n$ would also suffice for $n>1$ . The following theorem establishes $\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\gtrsim\psi(p,s,n)^{2}$ .

Theorem 1.

Suppose $1\leq s\leq p$ and $\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}$ . If $\eta\in(0,1)$ , then there exists $c_{\eta}>0$ depending only on $\eta$ such that for any $0<c<c_{\eta}$ we have

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\psi(p,s,n)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where $\psi$ is given by (11).

The lower bound is proved via Le Cam’s two point method (also known as the method of “fuzzy” hypotheses [45]) which is standard in the minimax hypothesis testing literature [24]. The prior distribution employed in the argument is a natural construction. Namely, the active set (i.e. the size $s$ set of nonzero coordinate functions) is selected uniformly at random and the corresponding nonzero coordinate functions are drawn from the usual univariate nonparametric prior employed in the literature [2, 24].

2.4 Upper bound

In this section, testing procedures are constructed and formal statements establishing rate-optimality are made. The form of the rate in (10) is a more convenient target, and should be kept in mind when reading the statements of our upper bounds.

2.4.1 Hard thresholding in the sparse regime

In this section, the sparse regime $s<\sqrt{p}$ is discussed. For any $d\in\mathbb{N}$ and $j\in[p]$ , define $E_{j}(d)=n\sum_{k\leq d}X_{k,j}^{2}$ where the data $\{X_{k,j}\}_{k\in\mathbb{N},j\in[p]}$ is defined via transformation to sequence space (5). For any $r\geq 0$ , define

T_{r}(d):=\sum_{j=1}^{p}\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}

(12)

where

\alpha_{r}(d):=\frac{E\left(||g||^{2}\mathbbm{1}_{\{||g||^{2}\geq d+r^{2}\}}\right)}{P\{||g||^{2}\geq d+r^{2}\}}

(13)

where $g\sim N(0,I_{d})$ . Note $\alpha_{r}(d)$ is a conditional expectation under the null hypothesis. The random variable $T_{r}(d)$ will be used as a test statistic. Such a statistic was first defined by Collier et al. [10], and similar statistics have been successfully employed in other signal detection problems [9, 34, 30, 8]. However, all previous works in the literature have only used this statistic with $d=1$ . For our problem, it is necessary to take growing $d$ . Consequently, a more refined analysis is necessary, and essentially new phenomena appear in the upper bound as we discuss later. As noted in Section 2.2, the quantity $\nu_{\mathcal{H}}$ can be conceptualized of as the correct truncation order, that is, it turns out the correct choice is $d\asymp\nu_{\mathcal{H}}$ . The choice of $r$ is more complicated, and in fact there are two separate regimes depending on the size of $d$ . The regime in which $d\gtrsim\log\left(1+\frac{p}{s^{2}}\right)$ is referred to as the “bulk” regime. The complementary regime $d\lesssim\log\left(1+\frac{p}{s^{2}}\right)$ is referred to as the “tail” regime.

Proposition 2 (Bulk).

Set $d=\nu_{\mathcal{H}}\vee\lceil D\rceil$ where $D$ is the universal constant from Lemma 15. There exist universal positive constants $K_{1},K_{2},$ and $K_{3}$ such that the following holds. Suppose $1\leq s<\sqrt{p}$ and $\log\left(1+\frac{p}{s^{2}}\right)\leq K_{3}^{2}d$ . Set

\tau(p,s,n)^{2}=\frac{s\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}.

If $\eta\in(0,1)$ , then there exists $C_{\eta}>0$ depending only on $\eta$ such that for all $C>C_{\eta}$ , we have

P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta

where $r=K_{2}\left(d\log\left(1+\frac{p}{s^{2}}\right)\right)^{1/4}$ . Here, $T_{r}(d)$ is given by (12).

Proposition 3 (Tail).

Set $d=\nu_{\mathcal{H}}\vee\lceil D\rceil$ where $D$ is the universal constant from Lemma 11. Let $K_{3}$ denote the universal positive constant from Proposition 2. There exist universal positive constants $K_{1}$ and $K_{2}$ such that the following holds. Suppose $1\leq s<\sqrt{p}$ and $\log\left(1+\frac{p}{s^{2}}\right)\geq K_{3}^{2}d$ . Set

\tau(p,s,n)^{2}=\frac{s\log\left(1+\frac{p}{s^{2}}\right)}{n}.

If $\eta\in(0,1)$ , then there exists $C_{\eta}>0$ depending only on $\eta$ such that for all $C>C_{\eta}$ , we have

P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta

where $r=K_{2}\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}$ . Here, $T_{r}(d)$ is given by (12).

Propositions 3 and 2 thus imply that, for $s<\sqrt{p}$ , the minimax separation rate is upper bounded by $s\log\left(1+\frac{p}{s^{2}}\right)/n+s\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}/n$ . By Lemma 1, it follows that

\varepsilon^{*}_{\mathcal{H}}(p,s,n)\lesssim\frac{s\log\left(1+\frac{p}{s^{2}}\right)}{n}+s\Gamma_{\mathcal{H}}

under the condition that $\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}$ . As established by Proposition 1 in Section 2.3, a condition like this is essential to avoid triviality.

Propositions 3 and 2 reveal an interesting phase transition in the minimax separation rate at the point

\nu_{\mathcal{H}}\asymp\log\left(1+\frac{p}{s^{2}}\right).

This phase transition phenomena is driven by the tail behavior of $\chi^{2}_{d}$ . Consider that under the null distribution $E_{j}(d)\sim\chi^{2}_{d}$ for all $j$ , and so the statistic $T_{r}(d)$ is the sum of $p$ independent and thresholded $\chi^{2}_{d}$ random variables. By a well-known lemma of Laurent and Massart [31] (also from Bernstein’s inequality up to constants), for any $u>0$ we have

P\left\{\chi^{2}_{d}-d\geq 2\sqrt{du}+2u\right\}\leq e^{-u}.

(14)

Roughly speaking, $\chi^{2}_{d}-d$ exhibits subgaussian-type deviation in the “bulk” $u\lesssim\sqrt{d}$ and subexponential-type deviation in the “tail” $u\gtrsim\sqrt{d}$ . Consequently, $T_{r}(d)$ should be intuited as a sum of thresholded subgaussian random variables when $r\lesssim\sqrt{d}$ , and as a sum of thresholded subexponential random variables when $r\gtrsim\sqrt{d}$ .

Examining (14), in the “tail” $u\gtrsim\sqrt{d}$ it follows $2\sqrt{du}+2u\asymp u$ , which no longer exhibits dependence on $d$ . Analogously, in Proposition 3 the threshold is taken as $r\gtrsim\sqrt{d}$ and so the resulting rate exhibits no dependence on $d$ and consequently no dependence on $\mathcal{H}$ . On the other hand, in the “bulk” $u\lesssim\sqrt{d}$ it follows $2\sqrt{du}+2u\asymp\sqrt{du}$ . Analogously, in Proposition 2 the threshold is taken as $r\lesssim\sqrt{d}$ and so the resulting rate indeed depends on $d$ and thus on $\mathcal{H}$ .

2.4.2 $\chi^{2}$ tests in the dense regime

The situation is less complicated in the dense regime $s\geq\sqrt{p}$ as it suffices to use the $\chi^{2}$ testing statistic $T:=n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}X_{k,j}^{2}$ .

Proposition 4.

Suppose $s\geq\sqrt{p}$ . Set

\tau(p,s,n)^{2}=\frac{\sqrt{p\nu_{\mathcal{H}}}}{n}.

If $\eta\in(0,1)$ , then there exists $C_{\eta}>0$ depending only on $\eta$ such that for all $C>C_{\eta}$ , we have

P_{0}\left\{T>p\nu_{\mathcal{H}}+\frac{2}{\sqrt{\eta}}\sqrt{p\nu_{\mathcal{H}}}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T\leq p\nu_{\mathcal{H}}+\frac{2}{\sqrt{\eta}}\sqrt{p\nu_{\mathcal{H}}}\right\}\leq\eta.

As in the sparse case, (10) Lemma 1 asserts $\tau(p,s,n)^{2}\asymp s\Gamma_{\mathcal{H}}$ provided that the condition $\log\left(1+\frac{p}{s^{2}}\right)\leq n/2$ is satisfied.

2.5 Special cases

Having formally stated lower and upper bounds for the minimax separation rate, it is informative to explore a number of special cases and witness a variety of phenomena. Throughout the illustrations it will be assumed $\log\left(1+\frac{p}{s^{2}}\right)\leq n/2$ as discussed earlier.

2.5.1 Sobolev

Taking the case $\mu_{k}\asymp k^{-2\alpha}$ as emblematic of Sobolev space with smoothness $\alpha>0$ , we obtain the minimax separation rate

\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+s\left(\frac{n}{\sqrt{\log\left(\frac{p}{s^{2}}\right)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s<\sqrt{p},\\ s\left(\frac{ns}{\sqrt{p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s\geq\sqrt{p}.\end{cases}

It is useful to compare to the rates obtained by Ingster and Lepski (Theorem 2 in [25]) (see also [17]) although their choice of parameter is space is slightly different from that considered in this paper. In the sparse case $s=O\left(p^{1/2-\delta}\right)$ for a constant $\delta\in(0,1/2]$ , their rate is

\begin{cases}s\left(\frac{n}{\sqrt{\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log p\leq n^{\frac{1}{2\alpha+1}},\\ s\frac{\log p}{n}&\textit{if }\log p>n^{\frac{1}{2\alpha+1}}.\end{cases}

In the dense regime $p=O(s^{2})$ , their rate (Theorem 1 in [25]) is $s\left(\frac{ns}{\sqrt{p}}\right)^{-\frac{4\alpha}{4\alpha+1}}$ . Quick algebra verifies that our rate indeed matches Ingster and Lepski’s rate in these sparsity regimes.

In the sparse regime, the strange looking phase transition at $\log p\asymp n^{\frac{1}{2\alpha+1}}$ in their rate now has a coherent explanation in view of our result. The situation $\log p\lesssim n^{\frac{1}{2\alpha+1}}$ corresponds to the “bulk” regime in which case $T_{r}(d)$ from (12) behaves like a sum of thresholded subgaussian random variables. On the other side where $\log p\gtrsim n^{\frac{1}{2\alpha+1}}$ , subexponential behavior is exhibited by $T_{r}(d)$ . In fact, our result gives more detail beyond $s\lesssim p^{1/2-\delta}$ . Assume only $s<\sqrt{p}$ (for example, $s=\frac{\sqrt{p}}{\log p}$ is allowed now). Then the phase transition between the “bulk” and “tail” regimes actually occurs at $\log\left(1+\frac{p}{s^{2}}\right)\asymp n^{\frac{1}{2\alpha+1}}$ .

2.5.2 Finite dimension

Consider a finite dimensional situation, that is $1=\mu_{1}=\mu_{2}=...=\mu_{m}>\mu_{m+1}=\mu_{m+2}=...=0$ for some positive integer $m$ . Function spaces exhibiting this kind of structure include linear functions, finite polynomials, and generally RKHSs based on kernels with finite rank. If $m<\frac{n^{2}}{\log\left(1+\frac{p}{s^{2}}\right)}$ , the minimax separation rate is

\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+\frac{s\sqrt{m\log\left(\frac{p}{s^{2}}\right)}}{n}&\textit{if }s<\sqrt{p},\\ \frac{\sqrt{pm}}{n}&\textit{if }s\geq\sqrt{p}.\end{cases}

In the sparse regime, the phase transition between the bulk and tail regimes occurs at $\log\left(1+\frac{p}{s^{2}}\right)\asymp m$ .

2.5.3 Exponential decay

As another example, consider exponential decay of the eigenvalues $\mu_{k}=c_{1}e^{-c_{2}k^{\gamma}}$ where $c_{1},c_{2}$ are universal constants and $\gamma>0$ . Such decay is a feature of RKHSs based on Gaussian kernels. The minimax separation rate is

\varepsilon_{\mathcal{H}}^{*}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p}{s^{2}}\right)}{n}+s\log^{\frac{1}{2\gamma}}\left(\frac{n}{\sqrt{\log\left(\frac{p}{s^{2}}\right)}}\right)\cdot\frac{\sqrt{\log\left(\frac{p}{s^{2}}\right)}}{n}&\textit{if }s<\sqrt{p}\\ \frac{\sqrt{p}}{n}\log^{\frac{1}{2\gamma}}\left(\frac{ns}{\sqrt{p}}\right)&\textit{if }s\geq\sqrt{p}.\end{cases}

In the sparse regime, the phase transition between the bulk and tail regimes occurs at $\log^{\gamma}\left(1+\frac{p}{s^{2}}\right)\asymp\log n$ . The minimax separation rate is quite close to the finite dimensional rate, which is sensible as RKHSs based on Gaussian kernels are known to be fairly “small” nonparametric function spaces [47].

3 Adaptation

Thus far the sparsity parameter $s$ has been assumed known, and the tests constructed make use of this information. In practice, the statistician is typically ignorant of the sparsity level and so it is of interest to understand whether adaptive tests can be furnished. In this section, we will establish an adaptive testing rate which accommodates a generic $\mathcal{H}$ , and it turns out to exhibit a cost for adaptation which depends delicately on the function space.

To the best of our knowledge, Spokoiny’s article was the first to demonstrate an unavoidable cost for adaptation in a signal detection problem [41]. Later work established unavoidable costs for adaptive testing across a variety of situations (see [24] and references therein). This early work largely focused on adapting to an unknown smoothness parameter in a univariate nonparametric regression setting. More recently adaptation to sparsity in high dimensional models has been studied. In many problems, one can adapt to sparsity without cost in the rate (nor in the constant for some problems) [14, 34, 30, 26, 1, 18]. In the context of sparse additive models in Sobolev space, Ingster and Lepski [25] (see also [17]) consider adaptation, and we discuss their results in Section 3.4.

3.1 Preliminaries

To formally state our result, some slight generalizations of the concepts found in Section 2.2 are needed.

Definition 4.

For $a>0$ , define $\nu_{\mathcal{H}}(s,a)$ to be the smallest positive integer $\nu$ satisfying

\mu_{\nu}\leq\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

Definition 5.

For $a>0$ , define $\Gamma_{\mathcal{H}}(s,a)$ to be

\Gamma_{\mathcal{H}}(s,a):=\max_{\nu\in\mathbb{N}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}.

Note $\nu_{\mathcal{H}}(s,a)$ is decreasing in $a$ and $\Gamma_{\mathcal{H}}(s,a)$ is increasing in $a$ . As discussed in Section 2.2, two different forms are useful when discussing the separation rate, and Lemma 1 facilitated that discussion. The following lemma is a slight generalization and has the same purpose.

Lemma 2.

Suppose $a>0$ . If $\log\left(1+\frac{pa}{s^{2}}\right)\leq\frac{n}{2}$ , then

\Gamma_{\mathcal{H}}(s,a)\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\sqrt{2}\Gamma_{\mathcal{H}}(s,a).

At a high level, the central issue with adaptation lies in the selection of an estimator’s or a test’s hyperparameter (such as the bandwidth or penalty). Typically, there is some optimal choice but it requires knowledge of an unknown parameter (e.g. smoothness or sparsity) in order to pick it. In the current problem, the optimal choice of $\nu$ is unknown since the sparsity level $s$ is unknown.

In adaptive testing, the typical strategy is to fix a grid of different values, construct a test for each potential value of the hyperparameter, and for detection use the test which takes the maximum over this collection of tests. Typically, a geometric grid is chosen, and the logarithm of the grid’s cardinality reflects a cost paid by the testing procedure [41, 34, 25]. It turns out this high level intuition holds for signal detection in sparse additive models, but the details of how to select the grid are not direct since a generic $\mathcal{H}$ must be accommodated. For $a>0$ , define

\mathscr{V}_{a}:=\left\{2^{k}:k\in\mathbb{N}\cup\{0\}\text{ and }2^{k-1}<\nu_{\mathcal{H}}(s,a)\leq 2^{k}\text{ for some }s\in[p]\right\}.

Define

\mathscr{A}_{\mathcal{H}}:=\sup\left\{a\geq 1:\log(e|\mathscr{V}_{a}|)\geq a\right\}

(15)

and define

\mathscr{V}_{\mathcal{H}}:=\mathscr{V}_{\mathscr{A}_{\mathcal{H}}}.

(16)

The grid $\mathscr{V}_{\mathcal{H}}$ will be used. It is readily seen that $\mathscr{A}_{\mathcal{H}}$ is finite as the crude bound $\mathscr{A}_{\mathcal{H}}\leq\log(ep)$ is immediate. The following lemma shows that $\mathscr{A}_{\mathcal{H}}$ is, in essence, a fixed point.

Lemma 3.

If $\log\left(1+p\mathscr{A}_{\mathcal{H}}\right)\leq\frac{n}{2}$ , then $\mathscr{A}_{\mathcal{H}}\leq\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}}$ .

It turns out an adaptive testing procedure can be constructed which pays a price determined by $\mathscr{A}_{\mathcal{H}}$ (equivalently $\log(e|\mathscr{V}_{\mathcal{H}}|)$ up to constant factors). To elaborate, assume $\log\left(1+p\mathscr{A}_{\mathcal{H}}\right)\leq n/2$ . We will construct an adaptive test which achieves

\begin{cases}\frac{s}{n}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)+s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}},\\ s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}.\end{cases}

The form of this separation rate is very similar to the minimax rate, except that the phase transition has shifted to $s\asymp\sqrt{p\mathscr{A}_{\mathcal{H}}}$ and a cost involving $\mathscr{A}_{\mathcal{H}}$ is incurred. The value of $\mathscr{A}_{\mathcal{H}}$ can vary quite a bit with the choice of $\mathcal{H}$ , and a few examples are illustrated in Section 3.4.

Our choice of the geometric grid yielding the fixed point characterization of Lemma 3 may not be immediately intuitive. It can be understood at a high level by noticing there are two competing factors at play. First, fix $a\geq 1$ and note it can be conceptualized to represent a target cost from scanning over some grid. To achieve that target cost, we should use the truncation levels $\nu_{\mathcal{H}}(s,a)$ for each sparsity $s$ . Now, consider that we will scan over the geometric grid $\mathscr{V}_{a}$ obtained from the truncation levels. If that geometric grid has cardinality which would incur a cost less than $a$ , then it would appear strange that we are scanning over a smaller grid to achieve a target cost associated to a larger grid. It is intuitive that we might try to aim for a lower target cost $a^{\prime}<a$ instead. But this changes our grid to $\mathscr{V}_{a^{\prime}}$ which now has a different cardinality, and so the same logic applies to $a^{\prime}$ . A similar line of reasoning applies if the grid had happened to be too large. Therefore, we intuitively want to select $a$ such that $\mathscr{V}_{a}$ has the proper cardinality to push the cost $a$ upon us. Hence, we are seeking a kind of fixed point in this sense.

3.2 Lower bound

The formulation of a lower bound is more delicate than one might initially anticipate. To illustrate the subtlety, suppose we have a candidate adaptive rate $s\mapsto\epsilon(s)$ . It is tempting to consider the following lower bound formulation; for any $\eta\in(0,1)$ , there exists $c_{\eta}>0$ such that for any $0<c<c_{\eta}$ ,

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{1\leq s\leq p}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\epsilon(s)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta.

While extremely natural, this criterion has an issue. In particular, if there exists $\bar{s}$ such that $\epsilon(\bar{s})\asymp\varepsilon^{*}(\bar{s})$ where $s\mapsto\varepsilon^{*}(s)$ denotes the minimax rate, then the above criterion is trivially satisfied. One simply lower bounds the maximum over all $s$ with the choice $s=\bar{s}$ and then invokes the minimax lower bound. To exaggerate, the candidate rate $\epsilon(1)=\varepsilon^{*}(1)$ and $\epsilon(s)=\infty$ for $s\geq 2$ satisfies the above criterion. Furthermore, from an upper bound perspective there is a trivial test which achieves this candidate rate. The absurdity from this lower bound criterion would then force us to conclude this candidate rate gives an adaptive testing rate.

To avoid such absurdities, we use a different lower bound criterion. The key issue in the formulation is that, in the domain of maximization for the Type II error, one must not include any $s$ for which the candidate rate is of the same order as the minimax rate.

Definition 6.

Suppose $s\mapsto\varepsilon_{\text{adapt}}(p,s,n)$ is a candidate adaptive rate and $\left\{\mathcal{S}_{p,n}\right\}_{p,n\in\mathbb{N}}$ are a collection of sets satisfying $\mathcal{S}_{p,n}\subset[p]$ for all $p$ and $n$ . We say $s\mapsto\varepsilon_{\text{adapt}}(p,s,n)$ satisfies the adaptive lower bound criterion with respect to $\left\{\mathcal{S}_{p,n}\right\}_{p,n\in\mathbb{N}}$ if the following two conditions hold.

(i)

For any $\eta\in(0,1)$ , there exists $c_{\eta}>0$ depending only on $\eta$ such that for all $0<c<c_{\eta}$ ,

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s\in\mathcal{S}_{p,n}}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\varepsilon_{\text{adapt}}(s)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

for all $p$ and $n$ .

(ii)

Either

\max_{p,n\in\mathbb{N}}\min_{s\in\mathcal{S}_{p,n}}\frac{\varepsilon_{\text{adapt}}(p,s,n)}{\varepsilon^{*}(p,s,n)}=\infty

(17)

\max_{p,n\in\mathbb{N}}\max_{s\in[p]}\frac{\varepsilon_{\text{adapt}}(p,s,n)}{\varepsilon^{*}(p,s,n)}<\infty

(18)

where $s\mapsto\varepsilon^{*}(p,s,n)$ denotes the minimax separation rate.

Intuitively, this criterion is nontrivial only when there are no sparsity levels in the chosen reference sets for which the candidate rate is the same order as the minimax rate. More explicitly, there is no sequence $\{s_{p,n}\}_{p,n\in\mathbb{N}}$ with $s_{p,n}\in\mathcal{S}_{p,n}$ along which the candidate and minimax rates match (up to constants). This criterion avoids trivialities such as the one described earlier. In Definition 6, formalizing the notion of “order” requires speaking in terms of sequences, though it may appear unfamiliar and clunky at first glance. Definition 6 is not new, but rather a direct port to the testing context of the notion of an adaptive rate of convergence on the scale of classes from [44] in the estimation setting (see also [11] for an application of this notion to linear functional estimation in the sparse Gaussian sequence model).

The condition (18) simply requires the candidate rate to match the minimax rate if (17) does not hold. In this way, the definition allows for the possibility of the minimax rate to be an adaptive rate (in those cases where no cost for adaptation needs to be paid).

We now state a lower bound with respect to this criterion. Define the candidate rate

\psi_{\text{adapt}}(p,s,n)^{2}:=\begin{cases}\frac{s}{n}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\vee s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}},\\ s\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}.\end{cases}

(19)

Here, $\mathscr{A}_{\mathcal{H}}$ is defined in (15). Examining this candidate, it is possible one may run into absurdities for $s<(p\mathscr{A}_{\mathcal{H}})^{1/2-\delta}$ since $\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\asymp\log p$ , meaning the candidate rate can match the minimax rate. Note here we have used that $\mathscr{A}_{\mathcal{H}}$ grows at most logarithmically in $p$ . Therefore, we will take $\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}$ (dropping the subscripts and writing $\mathcal{S}=\mathcal{S}_{p,n}$ for notational ease).

It follows from the fact that $\Gamma_{\mathcal{H}}(s,a)$ is increasing in $a$ that either (17) or (18) should typically hold. To see that $\Gamma_{\mathcal{H}}(s,a)$ is indeed increasing in $a$ , let us fix $a\leq a^{\prime}$ . Consider that for every $\nu\in\mathbb{N}$ we have

\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa^{\prime}}{s^{2}}\right)}}{n}.

Taking maximum over $\nu$ on both sides yields $\Gamma_{\mathcal{H}}(s,a)\leq\Gamma_{\mathcal{H}}(s,a^{\prime})$ , i.e. $\Gamma_{\mathcal{H}}$ is increasing in $a$ . Now, consider $a=1$ and $a^{\prime}=\mathscr{A}_{\mathcal{H}}$ in order to compare the candidate rate $\psi_{\text{adapt}}$ to the minimax rate $\varepsilon^{*}$ . Since $s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}$ , we have $\log\left(1+\frac{p}{s^{2}}\right)\asymp\frac{p}{s^{2}}$ and $\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\asymp\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}$ . With the above display, it is intuitive that either (17) or (18) should hold in typical nonpathological cases; these conditions can be easily checked on a problem by problem basis (such as those in Section 3.4).

In preparation for the statement of the lower bound, the following set is needed

\widetilde{\mathscr{V}}_{\mathcal{H}}:=\left\{2^{k}:k\in\mathbb{N}\cup\left\{0\right\}\text{ and }2^{k-1}<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\leq 2^{k}\text{ for some }s\in\mathcal{S}\right\}.

(20)

With $\widetilde{\mathscr{V}}_{\mathcal{H}}$ defined, we are ready to state the lower bound. The following result establishes condition (i) of Definition 6 for $\psi_{\text{adapt}}$ given by (19) with respect to our choice $\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}$ .

Theorem 2.

Suppose $\mathscr{A}_{\mathcal{H}}\asymp\log\left(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|\right)$ . Assume further $\log\left(1+p\mathscr{A}_{\mathcal{H}}\right)\leq\frac{n}{2}$ . If $\eta\in\left(0,1\right)$ , then there exists $c_{\eta}>0$ depending only on $\eta$ such that for all $0<c<c_{\eta}$ , we have

\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s\in\mathcal{S}}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq c\psi_{\text{adapt}}\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where $\psi_{\text{adapt}}$ is given by (19) and $\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}$ .

The condition $\mathscr{A}_{\mathcal{H}}\asymp\log\left(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|\right)$ is somewhat mild; for example it is satisfied when the eigenvalues satisfy a polynomial decay such as $\mu_{k}=c_{\alpha}k^{-2\alpha}$ for some $\alpha>0$ and positive universal constant $c_{\alpha}$ . It is also satisfied when there is exponential decay such as $\mu_{k}=c_{1}e^{-c_{2}j^{\gamma}}$ for some $\gamma>0$ and positive universal constants $c_{1}$ and $c_{2}$ . Sections 3.4 contains further discussion of these two cases.

The proof strategy is, at a high level, the same as that in Section 2.3 with the added complication that the prior should also randomize over the sparsity level in order to capture the added difficulty of not knowing the sparsity level. The random sparsity level is drawn by drawing a random truncation point $\nu$ and extracting the induced sparsity. The rest of the prior construction is similar to that in Section 2.3, but the analysis is complicated by the random sparsity level.

Finally, note Theorem 2 is a statement about a cost paid for adaptation for sparsity levels in $\mathcal{S}=\left\{s\in[p]:s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\}$ . Nothing is asserted about smaller sparsities. It would be interesting to show whether or not the candidate $\psi_{\text{adapt}}$ also captures the cost for sparsity levels $s$ which satisfy $(p\mathscr{A}_{\mathcal{H}})^{1/2-\delta}<s<\sqrt{p\mathscr{A}_{\mathcal{H}}}$ for all $\delta\in(0,1/2]$ (e.g. $s\asymp\sqrt{p\mathscr{A}_{\mathcal{H}}}/\log(p\mathscr{A}_{\mathcal{H}})$ ); we leave it open for future work.

3.3 Upper bound

In this section, an adaptive test is constructed. Recall $\mathscr{A}_{\mathcal{H}}$ and $\mathscr{V}_{\mathcal{H}}$ given in (15) and (16) respectively. Define

\tau_{\text{adapt}}^{2}(p,s,n)=\begin{cases}\left(\frac{s}{n}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\vee\left(\frac{s}{n}\sqrt{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\right)&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}},\\ \frac{s}{n}\sqrt{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}.\end{cases}

(21)

In the test, the following collection of geometric grids is used. Define

\mathscr{S}:=\left\{1,2,4,...,2^{\left\lceil\log_{2}\sqrt{p\mathscr{A}_{\mathcal{H}}}\right\rceil-1}\right\}\cup\{p\}.

(22)

The choice to take $\mathscr{S}$ as a geometric grid is not statistically critical, but doing so is computationally advantageous.

Theorem 3.

There exist universal positive constants $D,K_{2},K_{2}^{\prime},$ and $K_{3}$ such that the following holds. Let $\mathscr{A}_{\mathcal{H}}$ , $\mathscr{V}_{\mathcal{H}}$ , and $\mathscr{S}$ be given by (15), (16), and (22) respectively. For $\nu\in\mathscr{V}_{\mathcal{H}}$ and $s\in[p]$ , set $d_{\nu}=\nu\vee\lceil D\rceil$ and set

	$\displaystyle r_{\nu,s}$	$\displaystyle:=K_{2}\left(d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)^{1/4},$
	$\displaystyle r_{s}^{\prime}$	$\displaystyle:=K_{2}^{\prime}\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}.$

If $\eta\in(0,1)$ , then there exists $C_{\eta}>0$ depending only on $\eta$ such that for all $C>C_{\eta}$ , we have

P_{0}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}\neq 0\right\}+\max_{1\leq s^{*}\leq p}\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s^{*}},\\ ||f||_{2}\geq C\tau_{\text{adapt}}(p,s^{*},n)\end{subarray}}P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}\neq 1\right\}\leq\eta

where $\tau_{\text{adapt}}$ is given by (21). Here, $\varphi_{\nu,s}$ is given by

\varphi_{\nu,s}:=\begin{cases}\mathbbm{1}_{\left\{T_{r_{\nu,s}}(d_{\nu})>Cs\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\right\}}&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\textit{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\leq K_{3}\sqrt{d_{\nu}},\\ \mathbbm{1}_{\left\{T_{r_{s}^{\prime}}(d_{\nu})>Cs\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right\}}&\textit{if }s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\textit{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}>K_{3}\sqrt{d_{\nu}},\\ \mathbbm{1}_{\left\{n\sum_{j=1}^{p}\sum_{k\leq\nu}X_{k,j}^{2}>\nu p+C\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\right\}}&\textit{if }s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{cases}

and the statistic $T_{r}(d)$ is given by (12).

As mentioned earlier, the adaptive test involves scanning over the potential values of $\nu$ in the geometric grid $\mathscr{V}_{\mathcal{H}}$ . Consequently, a cost involving $\mathscr{A}_{\mathcal{H}}$ is paid. Note for $s<\sqrt{p\mathscr{A}_{\mathcal{H}}}$ we also need to scan over $s\in\mathscr{S}$ in order to set the thresholds in the statistics $T_{r}(d)$ properly. It turns out this extra scan does not incur a cost; cost is driven by needing to scan over $\nu$ .

To understand how scanning over $\nu$ can incur a cost, it is most intuitive to consider the need to control the Type I error when scanning over the $\chi^{2}$ statistics with various degrees of freedom. Roughly speaking, it is necessary to pick the threshold values $t_{\nu}$ such that the Type I error $P\left(\bigcup_{\nu\in\mathscr{V}_{\mathcal{H}}}\left\{\chi^{2}_{\nu p}-\nu p>t_{\nu}\right\}\right)$ is smaller than some prescribed error level. By union bound and (14), consider the tail bound $P\left(\bigcup_{\nu\in\mathscr{V}_{\mathcal{H}}}\left\{\chi^{2}_{\nu p}-\nu p>2\sqrt{\nu pu}+2u\right\}\right)\leq\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P\left\{\chi^{2}_{\nu p}-\nu p>2\sqrt{\nu pu}+2u\right\}\leq|\mathscr{V}_{\mathcal{H}}|e^{-u}$ . To ensure the Type I error is small, this tail bound suggests the choice $t_{\nu}\asymp\sqrt{\nu p\log|\mathscr{V}_{\mathcal{H}}|}+2\log|\mathscr{V}_{\mathcal{H}}|\asymp\sqrt{\nu p\log|\mathscr{V}_{\mathcal{H}}|}$ where the latter equivalence follows from the fact $|\mathscr{V}_{\mathcal{H}}|$ grows at most logarithmically in $p$ . Since the threshold $t_{\nu}$ is inflated, a signal must also have larger norm in order to be detected. The logarithmic inflation factor is a consequence of the $\chi^{2}$ tail.

3.4 Special cases

To illustrate how a price for adaptation depends delicately on the choice of function space $\mathcal{H}$ , we consider a a variety of cases. Throughout, the assumption $\log(1+p\mathscr{A}_{\mathcal{H}})\leq n/2$ as well as the other assumptions necessary for our result are made.

3.4.1 Sobolev

Taking $\mu_{k}\asymp k^{-2\alpha}$ as emblematic of Sobolev space with smoothness $\alpha$ , it can be shown that $\mathscr{A}_{\mathcal{H}}\asymp\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\asymp\log\log p$ (recall (15) and (20)). Our upper and lower bounds assert an adaptive testing rate is given by

\varepsilon_{\text{adapt}}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p\log\log p}{s^{2}}\right)}{n}+s\left(\frac{n}{\sqrt{\log\left(\frac{p\log\log p}{s^{2}}\right)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s<\sqrt{p\log\log p},\\ s\left(\frac{ns}{\sqrt{p\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }s\geq\sqrt{p\log\log p}.\end{cases}

Ingster and Lepski [25] consider adaptation in the sparse regime $s=O(p^{1/2-\delta})$ and the dense regime $s=p^{1/2+\delta}$ for $\delta\in(0,1/2)$ separately. In the sparse regime, they construct an adaptive test which is able to achieve the minimax rate; no cost is paid. In the dense rate, not only do they give a test which achieves $s\left(\frac{ns}{\sqrt{p\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}$ , they also supply a lower bound (now, in our lower bound notation, $\mathcal{S}=[p]$ can be taken) showing it cannot be improved. As seen in the above display, our test enjoys optimality in these regimes.

The reader should take care when comparing [25] to our result. In our setting, the unknown sparsity level can vary throughout the entire range $s\in[p]$ . In contrast, [25] consider two separate cases. Either the unknown sparsity $s$ is constrained to $s\leq p^{1/2-\delta}$ , or it is constrained to $s\geq p^{1/2+\delta}$ . To elaborate, the precise value of $s$ is unknown but [25] assumes the statistician has knowledge of which of the two cases is in force. In contrast, we make no such assumption; nothing at all is known about the sparsity level. In our view, this situation is more faithful to the problem facing the practitioner. Despite constructed for a seemingly more difficult setting, it turns out our test is optimal under Ingster and Lepski’s setting.

Our result provides finer detail around $\sqrt{p}$ missed by [25]. In Theorem 3 of their article, Ingster and Lepski propose an adaptive test procedure which is applicable in the regime $s\asymp\sqrt{p}$ . It requires signal strength of squared order at least

\sqrt{p}\left(\frac{n}{\sqrt{\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

This is suboptimal since our test achieves

\begin{cases}\sqrt{p}\left(\frac{n}{\sqrt{\log\log\log p}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log\log\log p\leq n^{\frac{1}{2\alpha+1}},\\ \frac{\sqrt{p}\log\log\log p}{n}&\textit{if }\log\log\log p>n^{\frac{1}{2\alpha+1}},\end{cases}

which is faster.

3.4.2 Finite dimension

Consider a finite dimensional structure $1=\mu_{1}=\mu_{2}=...=\mu_{m}>\mu_{m+1}=...=0$ where $m$ is a positive integer. If $m<\frac{n^{2}}{\log\left(1+p\right)}$ , it can be shown $\mathscr{A}_{\mathcal{H}}\asymp 1$ , and so the minimax separation rate can be achieved by our adaptive test. In other words, no cost is paid for adaptation.

3.4.3 Exponential decay

Consider exponential decay of the eigenvalues $\mu_{k}=c_{1}e^{-c_{2}k^{\gamma}}$ where $\gamma>0$ . It can be shown that $\mathscr{A}_{\mathcal{H}}\asymp\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)\asymp\log\log\log p$ , and so an adaptive rate is

\varepsilon_{\text{adapt}}(p,s,n)^{2}\asymp\begin{cases}\frac{s\log\left(\frac{p\log\log\log p}{s^{2}}\right)}{n}+s\log^{\frac{1}{2\gamma}}\left(\frac{n}{\sqrt{\log\left(\frac{p\log\log\log p}{s^{2}}\right)}}\right)\cdot\frac{\sqrt{\log\left(\frac{p\log\log\log p}{s^{2}}\right)}}{n}&\textit{if }s<\sqrt{p\log\log\log p},\\ \frac{\sqrt{p\log\log\log p}}{n}\log^{\frac{1}{2\gamma}}\left(\frac{ns}{\sqrt{p\log\log\log p}}\right)&\textit{if }s\geq\sqrt{p\log\log\log p}.\end{cases}

The cost for adaptation here grows very slowly, and is perhaps another indication of the relative “small” size of RKHSs based on Gaussian kernels (as noted in Section 2.5).

4 Adaptation to both sparsity and smoothness

So far we have assumed that the space $\mathcal{H}$ is known. However, it is likely that $\mathcal{H}$ is one constituent out of a collection of spaces indexed by some hyperparameter, such as a smoothness level. Typically, the true value of this hyperparameter is unknown in addition to the sparsity being unknown, and it is of interest to understand how the testing rate changes. To avoid excessive abstractness, we follow [25, 17] and adopt a working model where $\mathscr{T}_{s}$ has $\mu_{k}$ of the form $\mu_{k}=k^{-2\alpha}$ emblematic of Sobolev space with smoothness $\alpha$ . To emphasize the dependence on $\alpha$ , let us write $\mathscr{T}(s,\alpha)$ and its corresponding space $\mathcal{F}(s,\alpha)$ . Reiterating, we are interested in adapting to both the sparsity level $s$ and the smoothness level $\alpha$ .

Ingster and Lepski [25] study adaptation to both sparsity and smoothness. In particular, they study adaptation for an unknown $\alpha\in[\alpha_{0},\alpha_{1}]$ in a closed interval where the endpoints $\alpha_{0}<\alpha_{1}$ are known. As argued in [25], since $\alpha_{1}$ is known, the sequence space basis in Section 1.2.2 for any $\alpha\in[\alpha_{0},\alpha_{1}]$ can be taken to be $\alpha_{1}$ -regular, and thus is known to the statistician. Furthermore, they separately study the dense regime $s\geq p^{1/2+\delta}$ and the sparse regime $s<p^{1/2-\delta}$ for some constant $\delta\in(0,1/2)$ . We adopt the same setup as them.

Ingster and Lepski [25] claim the following adaptive rate. In the dense case $s\geq p^{1/2+\delta}$ , they make the assumption $\frac{p}{s^{2}}\log\left(\frac{n^{2}s^{2}/p}{\log(n^{2}s^{2}/p)}\right)\to 0$ and state (Theorem 4 in [25])

s\left(\frac{ns}{\sqrt{p\log\left(\frac{ns}{\sqrt{p}}\right)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

In contrast¹¹1The same error affects the proofs of the lower bounds in Theorems 4 and 5 in [25]. The prior they define is not supported on the parameter space. Under their prior (see (75) in [25]), the various coordinate functions $f_{j}$ end up having different smoothness levels $\alpha_{j}$ instead of sharing a single $\alpha$ ., we will show the adaptive rate in the dense case $s\geq p^{1/2+\delta}$ is

\varepsilon_{\text{dense}}^{2}(p,s,n,\alpha)\asymp s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

The careful reader will note that in the special case $p=1$ , our answer recovers the adaptive separation rate $\left(n/\sqrt{\log\log n}\right)^{-\frac{4\alpha}{4\alpha+1}}$ proved by Spokoiny [41]. In the sparse case $s<p^{1/2-\delta}$ , their claimed rate (Theorem 5 in [25]) is

\frac{s\log p}{n}+s\left(\frac{n}{\sqrt{\log(p\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

(23)

As a partial correction, we prove the following lower bound

\frac{s\log(p\log\log n)}{n}+s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

(24)

In the regime $\log p\gtrsim\log\log n$ , (23) and (24) match each other and match the minimax rate (see Section 2.5.1). However, in the case $\log p\lesssim\log\log n$ , there may be a difference between (23) and (24).

4.1 Dense

Fix $\delta\in(0,1/2)$ and $\alpha_{0}<\alpha_{1}$ . Recall in the dense regime that $s\geq p^{1/2+\delta}$ . Define

\tau_{\text{dense}}^{2}(p,s,n,\alpha):=s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

(25)

As mentioned earlier, we will be establishing this rate in the dense regime. For use in a testing procedure, define the geometric grid

\mathcal{V}_{\text{test}}:=\left\{1,2,4,8,...2^{\left\lceil\log_{2}\left(\left(\frac{np}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha_{0}+1}}\right)\right\rceil}\right\}.

(26)

Note the statistician does not need to know $s$ nor $\alpha$ to construct $\mathcal{V}_{\text{test}}$ . Further note $\log|\mathcal{V}_{\text{test}}|\asymp\log\log(np)$ .

4.1.1 Upper bound

The test procedure employed follows the typical strategy in adaptive testing, namely constructing individual tests for potential values of $\nu$ in the grid $\mathcal{V}_{\text{test}}$ , and then taking maximum over the individual tests to perform signal detection. The cost of adaptation involves the logarithm of the cardinality of $\mathcal{V}_{\text{test}}$ (i.e. $\log\log(np)$ ). Since the dense regime is in force, the individual tests are $\chi^{2}$ tests.

Theorem 4.

Fix $\delta\in(0,1/2)$ and $\alpha_{0}<\alpha_{1}$ . If $\eta\in(0,1)$ , then there exists $C_{\eta}>0$ depending only on $\eta$ such that for all $C>C_{\eta}$ , we have

P_{0}\left\{\max_{\nu\in\mathcal{V}_{\text{test}}}\varphi_{\nu}=1\right\}+\max_{s\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq C\tau_{\text{dense}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\max_{\nu\in\mathcal{V}_{\text{test}}}\varphi_{\nu}=0\right\}\leq\eta

where

\varphi_{\nu}=\mathbbm{1}_{\left\{n\sum_{j=1}^{p}\sum_{k\leq\nu}X_{k,j}^{2}\geq\nu p+K_{\eta}\left(\sqrt{\nu p\log\log(np)}+\log\log(np)\right)\right\}}.

with $K_{\eta}$ being a constant depending only on $\eta$ . Here $\tau_{\text{dense}}$ is given by (25) and $\mathcal{V}_{\text{test}}$ is given by (26).

4.1.2 Lower bound

The following theorem states the lower bound. Note that it satisfies Definition 6 with the straightforward modification to incorporate adaptation over $\alpha$ . In particular, the potential difficulty with absurdities outlined in Section 3.2 do not arise since the candidate rate

\tau^{2}_{\text{dense}}(p,s,n,\alpha)=s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}

is never of the same order as the minimax rate $\varepsilon^{*}(p,s,n,\alpha)^{2}=s\left(\frac{ns}{\sqrt{p}}\right)^{-\frac{4\alpha}{4\alpha+1}}$ .

Theorem 5.

Fix $\delta\in(0,1/2)$ and $\alpha_{0}<\alpha_{1}$ . If $\eta\in(0,1)$ , then there exists $c_{\eta}>0$ depending only on $\eta$ such that for all $0<c<c_{\eta}$ , we have

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq c\tau_{\text{dense}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where $\tau_{\text{dense}}$ is given by (25).

4.2 Sparse

Fix $\delta\in(0,1/2)$ and $\alpha_{0}<\alpha_{1}$ . Recall that in the sparse regime we have $s<p^{1/2-\delta}$ . Define

\tau_{\text{sparse}}^{2}(p,s,n,\alpha):=\frac{s\log\left(p\log\log n\right)}{n}+s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}.

(27)

Ingster and Lepski [25] give a testing procedure achieving (23) and thus establish the upper bound. As noted earlier, when $\log p\gtrsim\log\log n$ , not only do (27) and (23) match each other but they also match the minimax rate (see Section 2.5.1). In the regime $\log p\lesssim\log\log n$ , it is not clear from a lower bound perspective whether a cost for adaptation is unavoidable. We complement their upper bound by providing a lower bound which demonstrates that it is indeed necessary to pay a cost for adaptation. However, the cost we identify may not be sharp. To elaborate, in the case where $p$ is a large universal constant but much smaller than $n$ , the sparse regime rate (27) involves $\sqrt{\log\log\log n}$ . From Spokoiny’s article [41], $\sqrt{\log\log n}$ is instead expected. We leave it for future work to pin down the sharp cost.

Theorem 6.

Fix $\delta\in(0,1/2)$ and $\alpha_{0}<\alpha_{1}$ . If $\eta\in(0,1)$ , then there exists $c_{\eta}>0$ depending only on $\eta$ such that for all $0<c<c_{\eta}$ , we have

\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{s<p^{1/2-\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq c\tau_{\text{sparse}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

where $\tau_{\text{sparse}}$ is given by (27).

As discussed with respect to Theorem 5, the lower bound in Theorem 6 satisfies Definition 6 with the straightforward modification to incorporate adaptation over $\alpha$ .

5 Discussion

5.1 Relaxing the centered assumption

In Section 1.2.2, in the definition of the parameter space (4) the signal $f$ was constrained to be centered. It was claimed this constraint is mild and can be relaxed; this section elaborates on this point.

Suppose $f$ is an additive function (possibly uncentered) with each $f_{j}\in\mathcal{H}$ . Let $\bar{f}=\int_{[0,1]^{p}}f(x)\,dx$ and note we can write $f=\bar{f}\mathbf{1}_{p}+(f-\bar{f}\mathbf{1}_{p})$ where $\mathbf{1}_{p}\in L^{2}([0,1]^{p})$ is the constant function equal to one. Since $f$ is an additive function, for any $x\in[0,1]^{p}$ we have $f(x)=\bar{f}+g(x)$ where $g(x)=\sum_{j=1}^{p}g_{j}(x_{j})$ with $g_{j}(x_{j})=f_{j}(x_{j})-\int_{0}^{1}f_{j}(t)\,dt$ . In other words, we have $f=\bar{f}\mathbf{1}_{p}+g$ where $g$ itself is an additive function with each $g_{j}\in\mathcal{H}_{0}$ .

Asserting $f$ is a sparse additive function implies $g$ is a (centered) sparse additive function. Following [38, 17], we impose constraints (i.e. RKHS norm constraints) on the centered component functions $g_{j}$ . Therefore, the following uncentered signal detection problem can be considered,

	$\displaystyle H_{0}$	$\displaystyle:f\equiv 0,$		(28)
	$\displaystyle H_{1}$	$\displaystyle:\|\|f\|\|_{2}\geq\varepsilon\text{, }f\text{ is }s\text{-sparse additive, and }g\in\mathcal{F}_{s}$		(29)

where $\mathcal{F}_{s}$ is given by (4). Let $\varepsilon^{*}$ denote the minimax separation rate. Consider $||f||_{2}^{2}=\bar{f}^{2}+||g||_{2}^{2}$ by orthogonality. Thus, there are two related detection problems, Problem I

	$\displaystyle H_{0}$	$\displaystyle:f\equiv 0,$
	$\displaystyle H_{1}$	$\displaystyle:\|\bar{f}\|\geq\varepsilon_{1}\text{ and }f\text{ is }s\text{-sparse additive}$

and Problem II

	$\displaystyle H_{0}$	$\displaystyle:f\equiv 0,$
	$\displaystyle H_{1}$	$\displaystyle:\|\|g\|\|_{2}\geq\varepsilon_{2}\text{ and }g\in\mathcal{F}_{s}.$

Let $\varepsilon_{1}^{*}$ and $\varepsilon_{2}^{*}$ denote the minimax separation rates of the respective problems. We claim $(\varepsilon^{*})^{2}\asymp(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}$ . From $||f||^{2}=\bar{f}^{2}+||g||_{2}^{2}\geq\bar{f}^{2}\vee||g||_{2}^{2}$ it is directly seen that $(\varepsilon^{*})^{2}\gtrsim(\varepsilon_{1}^{*})^{2}\vee(\varepsilon_{2}^{*})^{2}\asymp(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}$ , and so the lower bound is proved. To show the upper bound, consider that if $||f||_{2}^{2}\geq C\left((\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}\right)$ , then by triangle inequality it follows that either $|\bar{f}|^{2}\geq C(\varepsilon_{1}^{*})^{2}$ or $||g||_{2}^{2}\geq C(\varepsilon_{2}^{*})^{2}$ . Thus, there is enough signal to detect and the test which takes maximum over the subproblem minimax tests is optimal. Hence, the upper bound $(\varepsilon^{*})^{2}\lesssim(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}$ is now also proved.

All that remains is to determine the minimax rates $\varepsilon_{1}^{*}$ and $\varepsilon_{2}^{*}$ . Let us index starting from zero and take $\{\psi_{k}\}_{k=0}^{\infty}$ and $\{\mu_{k}\}_{k=0}^{\infty}$ to be the associated eigenfunctions and eigenvalues of the RKHS $\mathcal{H}_{0}$ . Recall $\{\psi_{k}\}_{k=0}^{\infty}$ forms an orthonormal basis for $L^{2}([0,1])$ under the usual $L^{2}$ inner product. Since by definition $\mathcal{H}_{0}$ is orthogonal to the span of the constant function in $L^{2}([0,1])$ , without loss of generality we can take $\psi_{0}$ to be the constant function equal to one, $\mu_{0}=0$ , and $\{\psi_{k}\}_{k=1}^{\infty}$ to be the remaining eigenfunctions orthogonal (in $L^{2}$ ) to constant functions.

The data $\{X_{k,j}\}_{k\in\mathbb{N},j\in[p]}$ (defined in (5)) is sufficient for Problem II. Therefore, the minimax rate $\varepsilon_{2}^{*}$ is exactly given by our result established in this paper. The minimax rate $\varepsilon_{1}^{*}$ can be upper bounded in the following manner. Consider that $\langle\mathbf{1}_{p},dY_{x}\rangle_{L^{2}([0,1]^{p})}\sim N(\bar{f},\frac{1}{n})$ since $||\mathbf{1}_{p}||_{2}^{2}=1$ . Therefore, the parametric rate $\varepsilon_{1}^{*}\lesssim\frac{1}{\sqrt{n}}$ can be achieved. Hence, $(\varepsilon^{*})^{2}\asymp(\varepsilon_{1}^{*})^{2}+(\varepsilon_{2}^{*})^{2}\asymp(\varepsilon_{2}^{*})^{2}$ . In other words, centering is immaterial to the fundamental limits of the signal detection problem.

5.2 Sharp constant

As we noted earlier, our results do not say anything about the sharp constant in the detection boundary. The problem of obtaining a sharp characterization of the constants in the detection boundary is interesting and likely delicate. In an asymptotic setup and in the Sobolev case, Gayraud and Ingster [17] were able to derive the sharp constant in the sparse regime $s=p^{1-\beta}$ for fixed $\beta\in(1/2,1)$ under the condition $\log p=o(n^{\frac{1}{2\alpha+1}})$ . Gayraud and Ingster discuss that this condition is actually essential, in that the detection boundary no longer exhibits a dependence on $\beta$ when the condition is violated. This condition has a nice formulation in our notation, namely it is equivalent to the condition

\frac{\log(p/s^{2})}{n}=o\left(\Gamma_{\mathcal{H}}\right).

This correspondence in the Sobolev case suggests this condition may actually be essential for a generic $\mathcal{H}$ . It would be interesting to understand if that is true, and to derive the sharp constant when it holds if it so.

To be clear, it may be the case (perhaps likely) our proposed procedure is suboptimal in terms of the constant. Indeed, the existing literature on sparse signal detection, both in sparse sequence model [14, 18] and sparse linear regression [26, 1] rely on Higher Criticism type tests to achieve the optimal constant. Gayraud and Ingster [17] themselves use Higher Criticism. For a generic space $\mathcal{H}$ , our procedure should not be understood as the only test which is rate-optimal. In the sparsity regime $s=p^{1/2-\delta}$ , we suspect an analogous Higher Criticism type statistic which accounts for the eigenstructure of the kernel might not only achieve the optimal rate, but also the sharp constant.

5.3 Future directions

There are a number of avenues for future work. First, we only considered one space $\mathcal{H}$ in which all component functions $f_{j}$ live in. In some scenarios, it may be desirable to consider a different space $\mathcal{H}_{j}$ for each component. Raskutti et al. [38] obtained the minimax estimation rate when considering multiple kernels (under some conditions on the kernels). We imagine our broad approach in this work could be extended to determine the minimax separation rate allowing multiple kernels. Instead of a common $\nu_{\mathcal{H}}$ , it is likely different quantities $\nu_{\mathcal{H}_{j}}$ will be needed per coordinate and the test statistics could be modified in the natural way. The theory developed here could be used directly, and it seems plausible the minimax separation rate could be established in a straightforward manner.

Another avenue of research involves considering “soft” sparsity in the form of $\ell_{q}$ constraints for $0<q<1$ . Yuan and Zhou [50] developed minimax estimation rates for RKHSs exhibiting polynomial decay of its eigenvalues (e.g. Sobolev space). In terms of signal detection, its plausible that the quadratic functional estimator under the Gaussian sequence model with $\ell_{q}$ bound on the mean could be extended and used as a test statistic [10]. The hard sparsity and soft sparsity settings studied in [10] are handled quite similarly. It is possible not much additional theory needs to be developed in order to obtain minimax separation rates under soft sparsity.

Since hypothesis testing is quite closely related to functional estimation in many problems, it is natural to ask about functional estimation in the context of sparse additive models. For example, it would be of interest to estimate the quadratic functional $||f||_{2}^{2}$ or the norm $||f||_{2}$ . It is well known in the nonparametric literature that estimating $L_{r}$ norms for odd $r$ yields drastically different rates from testing and from even $r$ [32, 19]. A compelling direction is to investigate the same problem in the sparse additive model setting.

Additionally, it would be interesting to consider the nonparametric regression model $Y_{i}=f(X_{i})+Z_{i}$ where the design distribution $X_{i}\overset{iid}{\sim}P_{X}$ exhibits some dependence between the coordinates. The correspondence between the white noise model (1) and the nonparametric regression model we relied on requires that the design distribution be close to uniform on $[0,1]^{p}$ . However, in practical situations it is typically the case that the coordinates of $X_{i}$ exhibit dependence, and it would be interesting to understand how the fundamental limits of testing are affected.

Finally, in a somewhat different direction than that discussed so far, it is of interest to study the signal detection problem under group sparsity in other models. As we had encouraged, our results can be interpreted exclusively in terms of sequence space, that is the problem (7)-(8) with parameter space (6). From this perspective, the group sparse structure is immediately apparent. Estimation has been extensively studied in group sparse settings, especially in linear regression (see [47] and references therein). Hypothesis testing has not witnessed nearly the same level of research activity, and so there is ample opportunity. We imagine some features of the rates established in this paper are general features of the group sparse structure, and it would be intriguing to discover the commonalities.

6 Acknowledgments

SK is grateful to Oleg Lepski for a kind correspondence. The research of SK is supported in part by NSF Grants DMS-1547396 and ECCS-2216912. The research of CG is supported in part by NSF Grant ECCS-2216912, NSF Career Award DMS-1847590, and an Alfred Sloan fellowship.

7 Proofs

7.1 Minimax upper bounds

7.1.1 Sparse

Proof of Proposition 2.

Fix $\eta\in(0,1)$ . We will set $C_{\eta}$ at the end of the proof, so for now let $C>C_{\eta}$ . The universal constant $K_{1}$ will be selected in the course of the proof. Let $L^{*}$ denote the universal constant from Lemma 21. Set $K_{2}:=1\vee\frac{1}{(\log 2)^{1/4}}\vee c^{-1/4}\vee\left(\frac{L^{*}}{c}\right)^{1/4}$ and $K_{3}:=\frac{L^{*}}{K_{2}^{2}}$ where $c=c^{*}\wedge c^{**}$ where $c^{*}$ and $c^{**}$ are the universal constants in the exponential terms of Lemmas 15 and 14 respectively.

We first bound the Type I error. Since $\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{3}\sqrt{d}$ , we have $1\leq K_{2}^{2}\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{2}^{2}K_{3}\sqrt{d}\leq L^{*}\sqrt{d}$ . Therefore, we can apply Lemma 15 to obtain that

P_{0}\left\{T_{r}(d)>C^{*}\left(\sqrt{xpr^{4}e^{-\frac{c^{*}r^{4}}{d}}}+\frac{d}{r^{2}}x\right)\right\}\leq e^{-x}

for any $x>0$ . Here, $C^{*}$ is a universal constant. Taking $x=C$ and noting that $C>1$ provided we select $C_{\eta}\geq 1$ , we see that

	$\displaystyle C^{}\left(\sqrt{xpr^{4}e^{-\frac{c^{}r^{4}}{d}}}+\frac{d}{r^{2}}x\right)$	$\displaystyle\leq C^{}C\left(K_{2}^{2}\sqrt{pd\log\left(1+\frac{p}{s^{2}}\right)e^{-c^{}K_{2}^{4}\log\left(1+\frac{p}{s^{2}}\right)}}+\frac{d}{r^{2}}\right)$
		$\displaystyle\leq C^{*}C\left(K_{2}^{2}\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}\sqrt{p\cdot\frac{s^{2}}{s^{2}+p}}+\frac{d}{r^{2}}\right)$
		$\displaystyle\leq C^{*}C\left(K_{2}^{2}s\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}+\frac{d}{r^{2}}\right)$
		$\displaystyle\leq 2C^{*}K_{2}^{2}Cs\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}$
		$\displaystyle=CK_{1}n\tau(p,s,n)^{2}$

where we have used that $c^{*}K_{2}^{4}\geq 1$ and where we have selected $K_{1}=2C^{*}K_{2}^{2}$ . Note we have also used that $\frac{d}{r^{2}}\leq\sqrt{d}\leq K_{2}^{2}s\sqrt{d\log\left(1+\frac{p}{s^{2}}\right)}$ . Thus, with these choices of $K_{1},K_{2},$ and $x$ , we have

P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}\leq e^{-x}=e^{-C}\leq e^{-C_{\eta}}\leq\frac{\eta}{2}

provided we select $C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1$ .

We now examine the Type II error. To bound the Type II error, we will use Chebyshev’s inequality. In particular, consider that for any $f\in\mathcal{F}_{s}$ with $||f||_{2}\geq C\tau(p,s,n)$ , we have

	$\displaystyle P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}$	$\displaystyle=P_{f}\left\{E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\leq E_{f}(T_{r}(d))-T_{r}(d)\right\}$
		$\displaystyle\leq\frac{\operatorname{Var}_{f}\left(T_{r}(d)\right)}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}$		(30)

provided that $E_{f}(T_{r}(d))\geq CK_{1}n\tau(p,s,n)^{2}$ . To ensure this application of Chebyshev’s inequality is valid, we must compute suitable bounds for the expectation and variance of $T_{r}(d)$ , which the following lemmas provide.

Lemma 4.

If $C_{\eta}$ is larger than some sufficiently large universal constant, then $E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}$ .

Lemma 5.

If $C_{\eta}$ is larger than some sufficiently large universal constant, then $\operatorname{Var}_{f}(T_{r}(d))\leq C^{\dagger}\left(s^{2}r^{4}+E_{f}(T_{r}(d))\right)$ where $C^{\dagger}>0$ is a universal constant.

These bounds are proved later on. Let us now describe why they allow us to bound the Type II error. From (30) as well as Lemmas 4 and 5, we have

	$\displaystyle P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}$	$\displaystyle\leq\frac{\operatorname{Var}_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}$
		$\displaystyle\leq\frac{C^{\dagger}s^{2}r^{4}+C^{\dagger}E_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}$
		$\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}s^{2}\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}{C^{2}K_{1}^{2}n^{2}\tau(p,s,n)^{4}}+\frac{C^{\dagger}E_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}$
		$\displaystyle=\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{C^{\dagger}E_{f}(T_{r}(d))}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}$
		$\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{C^{\dagger}E_{f}(T_{r}(d))}{\frac{1}{4}\left(E_{f}(T_{r}(d))\right)^{2}}$
		$\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{4C^{\dagger}}{E_{f}(T_{r}(d))}$
		$\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{4C^{\dagger}}{2CK_{1}n\tau(p,s,n)^{2}}$
		$\displaystyle\leq\frac{C^{\dagger}\left\lceil D\right\rceil K_{2}^{4}}{C^{2}K_{1}^{2}}+\frac{2C^{\dagger}}{CK_{1}\sqrt{\log 2}}$

provided we pick $C_{\eta}$ larger than a sufficiently large universal constant as required by Lemmas 4 and 5. With this bound in hand and since $C\geq C_{\eta}$ , we can now pick $C_{\eta}$ sufficiently large depending only on $\eta$ to obtain

\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)\right\}\leq\frac{\eta}{2}.

To summarize, we can pick $C_{\eta}$ depending only on $\eta$ to be sufficiently large and also satisfying the condition $C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1$ and those of Lemmas 4, 5 to ensure the testing risk is bounded by $\eta$ , i.e.

P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta,

as desired. ∎

It remains to prove Lemmas 4 and 5. Recall we work in the environment of the proof of Proposition 2.

Proof of Lemma 4.

To prove the lower bound on the expectation, first recall

E_{f}(T_{r}(d))=\sum_{j=1}^{p}E_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right).

Under $P_{f}$ , we have $E_{j}(d)\sim\chi^{2}_{d}(m_{j}^{2})$ where $m_{j}^{2}=n\sum_{k\leq d}\theta_{k,j}^{2}$ . Here, the collection $\{\theta_{k,j}\}$ denotes the basis coefficients of $f$ . Intuitively, there are two reasons why the expectation might be small. First, we are thresholding and so we are intuitively removing those coordinates with small, but nonetheless nonzero, means. Furthermore, we are truncating at level $d$ and not considering higher-order basis coefficients; this also incurs a loss in signal.

Let us first focus on the effect from thresholding. Let $\widetilde{C}$ denote the universal constant from Lemma 13. Applying Lemma 13 yields

	$\displaystyle E_{f}(T_{r}(d))$	$\displaystyle=\sum_{j=1}^{p}E_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)$
		$\displaystyle=\sum_{j\in S_{f}}E_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)$
		$\displaystyle\geq\sum_{j\in S_{f}:m_{j}^{2}\geq\widetilde{C}r^{2}}\frac{m_{j}^{2}}{2}$
		$\displaystyle=\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}-\sum_{j\in S_{f}:m_{j}^{2}<\widetilde{C}r^{2}}\frac{m_{j}^{2}}{2}$
		$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}$

where $S_{f}\subset[p]$ denotes the subset of active variables $j$ such that $\Theta_{j}\neq 0$ . Note that such $S_{f}$ exists and $|S_{f}|\leq s$ since $f\in\mathcal{F}_{s}$ . We have thus bounded the amount of signal lost from thresholding.

Let us now examine the effect of truncation. Consider that

	$\displaystyle\|\|f\|\|_{2}^{2}$	$\displaystyle\leq\sum_{j\in S_{f}}\sum_{k=1}^{\infty}\theta_{k,j}^{2}$
		$\displaystyle=\sum_{j\in S_{f}}\left(\sum_{k\leq d}\theta_{k,j}^{2}+\sum_{k>d}\theta_{k,j}^{2}\right)$
		$\displaystyle\leq\sum_{j\in S_{f}}\left(\sum_{k\leq d}\theta_{k,j}^{2}+\mu_{d+1}\sum_{k>d}\frac{\theta_{k,j}^{2}}{\mu_{k}}\right)$
		$\displaystyle\leq\left(\sum_{j\in S_{f}}\sum_{k\leq d}\theta_{k,j}^{2}\right)+s\mu_{d+1}.$

Here, we have used $\sum_{k=1}^{\infty}\mu_{k}^{-1}\theta_{k,j}^{2}\leq 1$ for all $j$ . Therefore, we have shown $\sum_{j\in S_{f}}\sum_{k\leq d}\theta_{k,j}^{2}\geq||f||_{2}^{2}-s\mu_{d+1}$ , and thus have quantified the loss due to truncation.

We are now in position to put together the two pieces. Consider $||f||_{2}^{2}\geq C^{2}\tau(p,s,n)^{2}$ . Consequently,

\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\geq\frac{C^{2}n\tau(p,s,n)^{2}-ns\mu_{d+1}}{2}\geq\frac{C^{2}n\tau(p,s,n)^{2}-ns\mu_{\nu_{\mathcal{H}}+1}}{2}.

We have used the decreasing order of the kernel’s eigenvalues, i.e. that $d\geq\nu_{\mathcal{H}}$ implies $\mu_{\nu_{\mathcal{H}}+1}\geq\mu_{d+1}$ . Further, consider that by definition of $\nu_{\mathcal{H}}$ and $\tau(p,s,n)^{2}$ , we have

\mu_{\nu_{\mathcal{H}}+1}\leq\mu_{\nu_{\mathcal{H}}}\leq\frac{\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}=\frac{\tau(p,s,n)^{2}}{s}.

With this in hand, it follows that $\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\geq\left(\frac{C^{2}-1}{2}\right)n\tau(p,s,n)^{2}$ . To summarize, we have shown

$\displaystyle E_{f}(T_{r}(d))$	$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}$
	$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}K_{2}^{2}\sqrt{\lceil D\rceil}n\tau(p,s,n)^{2}}{2}$
	$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)\left(1-\frac{\widetilde{C}K_{2}^{2}\sqrt{\lceil D\rceil}}{C^{2}-1}\right).$	(31)

Here, we have used $d\leq\lceil D\rceil\nu_{\mathcal{H}}$ . We can also conclude

E_{f}(T_{r}(d))\geq\left(\frac{C^{2}-1-K_{2}^{2}\widetilde{C}\sqrt{\lceil D\rceil}}{2}\right)n\tau(p,s,n)^{2}.

In view of the above bound and since $C>C_{\eta}$ , it suffices to pick $C_{\eta}$ large enough to satisfy $C_{\eta}^{2}-4K_{1}C_{\eta}\geq 1+K_{2}^{2}\widetilde{C}$ to ensure $E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}$ . The proof of the lemma is complete. ∎

Proof of Lemma 5.

To bound the variance of $T_{r}(d)$ , recall that $\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{3}\sqrt{d}$ , and so $1\leq K_{2}^{2}\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\leq K_{2}^{2}K_{3}\sqrt{d}\leq L^{*}\sqrt{d}$ . Therefore, we can apply Lemma 14. By Lemma 14, we have

	$\displaystyle\operatorname{Var}_{f}\left(T_{r}(d)\right)$	$\displaystyle=\sum_{j=1}^{p}\operatorname{Var}_{f}\left((E_{j}(d)-\alpha_{r}(d))\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)$
		$\displaystyle\leq C^{\dagger}pr^{4}\exp\left(-\frac{c^{**}r^{4}}{d}\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}:m_{j}^{2}>4r^{2}}m_{j}^{2}$
		$\displaystyle\leq C^{\dagger}pr^{4}\exp\left(-\frac{c^{**}r^{4}}{d}\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}$

where $C^{\dagger}$ is a positive universal constant whose value can change from instance to instance. Recall $c^{**}$ is defined at the beginning of the proof of Proposition 2. Since $r^{4}=K_{2}^{4}d\log\left(1+\frac{p}{s^{2}}\right)$ and $c^{**}K_{2}^{4}\geq 1$ , we have

\displaystyle pr^{4}\exp\left(-\frac{c^{**}r^{4}}{d}\right)

\displaystyle\leq pr^{4}\exp\left(-\log\left(1+\frac{p}{s^{2}}\right)\right)\leq pr^{4}\cdot\frac{s^{2}}{s^{2}+p}\leq s^{2}r^{4}.

Therefore,

\operatorname{Var}_{f}(T_{r}(d))\leq C^{\dagger}s^{2}r^{4}+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}\leq 2C^{\dagger}s^{2}r^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}.

Taking $C_{\eta}$ larger than a sufficiently large universal constant, noting $C\geq C_{\eta}$ , and invoking (31), we have the desired result. ∎

Proof of Proposition 3.

Our proof is largely the same as in the proof of Proposition 2, except we invoke results about the “tail” rather than the “bulk”. Fix $\eta\in(0,1)$ . We will make a choice of $C_{\eta}$ at the end of the proof, so for now let $C>C_{\eta}$ . We select $K_{1}$ in the course of the proof, but we will select $K_{2}$ now. Set $K_{2}:=\frac{1}{\sqrt{\log 2}}\vee c^{-1/2}\vee\left(cK_{3}^{2}\right)^{-1/4}$ with $c=c^{*}\wedge c^{**}$ where $c^{*}$ and $c^{**}$ are the universal constants in the exponential terms of Lemmas 11 and 10 respectively.

We first bound the Type I error. Since $\log\left(1+\frac{p}{s^{2}}\right)\geq K_{3}^{2}d$ , we have $r^{2}\gtrsim d$ . Since $d\geq D$ , we have by Lemma 11 that for any $x>0$ ,

P_{0}\left\{T_{r}(d)>C^{*}\left(\sqrt{xpr^{4}e^{-c^{*}r^{2}}}+x\right)\right\}\leq e^{-x}

where $C^{*}$ is a universal positive constant. Taking $x=C$ and noting $C>1$ provided we have chosen $C_{\eta}\geq 1$ , we see that

	$\displaystyle C^{}\left(\sqrt{xpr^{4}e^{-c^{}r^{2}}}+x\right)$	$\displaystyle\leq C^{}C\left(K_{2}^{2}\log\left(1+\frac{p}{s^{2}}\right)\sqrt{pe^{-c^{}K_{2}^{2}\log\left(1+\frac{p}{s^{2}}\right)}}+1\right)$
		$\displaystyle\leq C^{*}C\left(K_{2}^{2}\log\left(1+\frac{p}{s^{2}}\right)\sqrt{p\cdot\frac{s^{2}}{s^{2}+p}}+1\right)$
		$\displaystyle\leq 2C^{*}CK_{2}^{2}s\log\left(1+\frac{p}{s^{2}}\right)$
		$\displaystyle=CK_{1}n\tau(p,s,n)^{2}$

where we have used that $c^{*}K_{2}^{2}\geq 1$ and we have set $K_{1}:=2C^{*}K_{2}^{2}$ . Thus, with these choices of $K_{1},K_{2},$ and $x$ we have

P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}\leq e^{-x}=e^{-C}\leq e^{-C_{\eta}}\leq\frac{\eta}{2}

provided we select $C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1$ .

	$\displaystyle P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}$	$\displaystyle=P_{f}\left\{E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\leq E_{f}(T_{r}(d))-T_{r}(d)\right\}$
		$\displaystyle\leq\frac{\operatorname{Var}_{f}\left(T_{r}(d)\right)}{\left(E_{f}(T_{r}(d))-CK_{1}n\tau(p,s,n)^{2}\right)^{2}}$		(32)

provided that $E_{f}(T_{r}(d))\geq CK_{1}n\tau(p,s,n)^{2}$ . To ensure this application of Chebyshev’s inequality is valid and to bound the Type II error, we will need a lower bound on the expectation of $T_{r}(d)$ . We will also need an upper bound on the variance of $T_{r}(d)$ in order to bound the Type II error. The following lemmas provide us with the requisite bounds; they are analogous to Lemmas 4 and 5 but are now in the context of the tail regime.

Lemma 6.

If $C_{\eta}$ is larger than some sufficiently larger universal constant, then $E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}$ .

Lemma 7.

If $C_{\eta}$ is larger than some sufficiently large universal constant, then $\operatorname{Var}_{f}(T_{r}(d))\leq C^{{\dagger}}(s^{2}r^{4}+E_{f}(T_{r}(d)))$ where $C^{\dagger}>0$ is a universal constant.

With these bounds in hand, the argument in the proof of Proposition 2 can be essentially repeated to establish that

P_{0}\left\{T_{r}(d)>CK_{1}n\tau(p,s,n)^{2}\right\}+\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ ||f||_{2}\geq C\tau(p,s,n)\end{subarray}}P_{f}\left\{T_{r}(d)\leq CK_{1}n\tau(p,s,n)^{2}\right\}\leq\eta

provided $C_{\eta}\geq\log\left(\frac{2}{\eta}\right)\vee 1$ and $C_{\eta}$ sufficiently large to satisfy Lemmas 6 and 7. We omit the details for brevity. ∎

It remains to prove Lemmas 6 and 7.

Proof of Lemma 6.

The proof is similar in style to the proof of Lemma 4, except now results for the tail regime are invoked. Letting $\widetilde{C}$ denote the universal constant from Lemma 9, applying Lemma 9, and arguing similarly to the proof of Lemma 4 , we obtain

E_{f}\left(T_{r}(d)\right)\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}

where $S_{f}\subset[p]$ denotes the subset of active variables $j$ such that $\Theta_{j}\neq 0$ . Note that such $S_{f}$ exists and $|S_{f}|\leq s$ since $f\in\mathcal{F}_{s}$ . Further arguing like the proof of Lemma 4 and using $\sqrt{\log\left(1+\frac{p}{s^{2}}\right)}\geq K_{3}\sqrt{d}$ , we have

\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\geq\left(\frac{C^{2}-\frac{1}{K_{3}}}{2}\right)n\tau(p,s,n)^{2}.

(33)

To summarize, we have shown

$\displaystyle E_{f}\left(T_{r}(d)\right)$	$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}sr^{2}}{2}$
	$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)-\frac{\widetilde{C}K_{2}^{2}n\tau(p,s,n)^{2}}{2}$
	$\displaystyle\geq\left(\sum_{j\in S_{f}}\frac{m_{j}^{2}}{2}\right)\left(1-\frac{\widetilde{C}K_{2}^{2}}{C^{2}-\frac{1}{K_{3}}}\right).$	(34)

Note that we also can conclude

E_{f}\left(T_{r}(d)\right)\geq\left(\frac{C^{2}-\frac{1}{K_{3}}-K_{2}^{2}\widetilde{C}}{2}\right)n\tau(p,s,n)^{2}.

Since $C>C_{\eta}$ , it suffices to pick $C_{\eta}$ large enough to satisfy $C_{\eta}^{2}-4K_{1}C_{\eta}\geq\frac{1}{K_{3}}+K_{2}^{2}\widetilde{C}$ to ensure $E_{f}(T_{r}(d))\geq 2CK_{1}n\tau(p,s,n)^{2}$ . The proof of the lemma is complete. ∎

Proof of Lemma 7.

Recall that $\log\left(1+\frac{p}{s^{2}}\right)\geq K_{3}^{2}d$ . By definition of $r^{2}$ we have $r^{2}\geq K_{2}^{2}K_{3}^{2}d$ . In other words $r^{2}\gtrsim d$ and so we can apply Lemma 10. By Lemma 10, we have

	$\displaystyle\operatorname{Var}_{f}\left(T_{r}(d)\right)$	$\displaystyle=\sum_{j=1}^{p}\operatorname{Var}_{f}\left(\left(E_{j}(d)-\alpha_{r}(d)\right)\mathbbm{1}_{\{E_{j}(d)\geq d+r^{2}\}}\right)$
		$\displaystyle\leq C^{{\dagger}}pr^{4}\exp\left(-c^{**}\min\left(\frac{r^{4}}{d},r^{2}\right)\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}:m_{j}^{2}>4r^{2}}m_{j}^{2}$
		$\displaystyle\leq C^{{\dagger}}pr^{4}\exp\left(-c^{**}\min\left(\frac{r^{4}}{d},r^{2}\right)\right)+C^{\dagger}sr^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}.$

where $C^{{\dagger}}$ is a positive universal constant. Recall we had defined $c^{**}$ at the beginning of the proof of Proposition 3. Since $r^{2}\geq K_{2}^{2}K_{3}^{2}d$ , we have $\frac{r^{4}}{d}\geq r^{2}K_{2}^{2}K_{3}^{2}$ . Therefore,

\displaystyle\exp\left(-c^{**}\min\left(\frac{r^{4}}{d},r^{2}\right)\right)\leq\exp\left(-c^{**}\left(K_{2}^{2}K_{3}^{2}\wedge 1\right)r^{2}\right)\leq\exp\left(-\log\left(1+\frac{p}{s^{2}}\right)\right)\leq\frac{s^{2}}{s^{2}+p}

where we have used that $c^{**}\left(K_{2}^{2}K_{3}^{2}\wedge 1\right)K_{2}^{2}\geq 1$ by definition of $K_{2}$ . Therefore,

\operatorname{Var}_{f}\left(T_{r}(d)\right)\leq 2C^{\dagger}s^{2}r^{4}+C^{\dagger}\sum_{j\in S_{f}}m_{j}^{2}.

Taking $C_{\eta}$ larger than a sufficiently large universal constant and invoking (7.1.1) yields the desired result. ∎

7.1.2 Dense

Proof of Proposition 4.

For ease of notation, set $L=\frac{2}{\sqrt{\eta}}$ . Define $C_{\eta}:=\left(\sqrt{2\sqrt{2}L+1}\right)\vee\left(\sqrt{1+\frac{4}{\sqrt{\eta}}+L}\right)\vee\left(\sqrt{\frac{64\sqrt{2}}{\eta}+1}\right)$ . Let $C>C_{\eta}$ . We first bound the Type I error. Consider that under $P_{0}$ we have $T\sim\chi^{2}_{p\nu_{\mathcal{H}}}$ . Observe that $E_{0}\left(T\right)=p\nu_{\mathcal{H}}$ and $\operatorname{Var}_{0}\left(T\right)=2p\nu_{\mathcal{H}}$ . Consequently, we have by Chebyshev’s inequality

P_{0}\left\{T>p\nu_{\mathcal{H}}+L\sqrt{p\nu_{\mathcal{H}}}\right\}\leq\frac{\operatorname{Var}_{0}(T)}{L^{2}p\nu_{\mathcal{H}}}=\frac{2}{L^{2}}\leq\frac{\eta}{2}.

We now examine the Type II error. Under $P_{f}$ , we have $T\sim\chi^{2}_{p\nu_{\mathcal{H}}}\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)$ where $\{\theta_{k,j}\}$ denote the basis coefficients associated to $f$ . Note $E_{f}(T)=p\nu_{\mathcal{H}}+n\sum_{j=1}^{p}\sum_{k\leq\nu_{H}}\theta_{k,j}^{2}$ and $\operatorname{Var}_{f}(T)=2p\nu_{\mathcal{H}}+4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}$ . In order to proceed with the argument, we need to obtain a lower bound estimate for the signal strength. Letting $S$ denote the set of $j$ for which $\Theta_{j}$ are nonzero, consider that for $||f||_{2}^{2}\geq C^{2}\tau(p,s,n)^{2}$ we have

	$\displaystyle C^{2}\tau(p,s,n)^{2}$	$\displaystyle\leq\|\|f\|\|_{2}^{2}$
		$\displaystyle=\sum_{j=1}^{p}\sum_{k=1}^{\infty}\theta_{k,j}^{2}$
		$\displaystyle\leq\sum_{j\in S}\left(\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}+\sum_{k>\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)$
		$\displaystyle\leq\sum_{j\in S}\left(\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}+\mu_{\nu_{\mathcal{H}}}\sum_{k>\nu_{\mathcal{H}}}\frac{\theta_{k,j}^{2}}{\mu_{k}}\right)$
		$\displaystyle\leq\left(\sum_{j\in S}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)+s\mu_{\nu_{\mathcal{H}}}.$

We have used $\sum_{k=1}^{\infty}\frac{\theta_{k,j}^{2}}{\mu_{k}}\leq 1$ for all $j$ in the final line. Therefore by definition of $\nu_{\mathcal{H}}$ we have

\displaystyle n\sum_{j\in S}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\geq nC^{2}\tau(p,s,n)^{2}-ns\mu_{\nu_{\mathcal{H}}}\geq\left(C_{\eta}^{2}-1\right)\sqrt{\nu_{\mathcal{H}}p}

where we have used that $ns\mu_{\nu_{\mathcal{H}}}\leq s\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}\leq\sqrt{p\nu_{\mathcal{H}}}$ .

We now continue with bounding the Type II error. By Chebyshev’s inequality, we have

	$\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}P_{f}\left\{T\leq p\nu_{\mathcal{H}}+L\sqrt{p\nu_{\mathcal{H}}}\right\}$
	$\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}P_{f}\left\{E_{f}(T)-p\nu_{\mathcal{H}}-L\sqrt{p\nu_{\mathcal{H}}}\leq E_{f}(T)-T\right\}$
	$\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{\operatorname{Var}_{f}\left(T\right)}{\left(E_{f}(T)-p\nu_{\mathcal{H}}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}$
	$\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2p\nu_{\mathcal{H}}+4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}$
	$\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2p\nu_{\mathcal{H}}}{\left(C_{\eta}^{2}-1-L\right)^{2}\nu_{\mathcal{H}}p}+\frac{4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}$
	$\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(\frac{1}{2}n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)^{2}}$
	$\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}$
	$\displaystyle\leq\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{(C_{\eta}^{2}-1)\sqrt{\nu_{H}p}}$
	$\displaystyle\leq\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{C_{\eta}^{2}-1}$
	$\displaystyle\leq\frac{\eta}{4}+\frac{\eta}{4}$
	$\displaystyle\leq\frac{\eta}{2}.$

Therefore, the sum of Type I and Type II errors is bounded by $\eta$ as desired. ∎

7.2 Minimax lower bounds

Proof of Proposition 1.

We break up the analysis into two cases.

Case 1: Suppose $s<\sqrt{p}$ . We will construct a prior distribution $\pi$ on $\mathscr{T}_{s}$ and use Le Cam’s two point method to furnish a lower bound. Define $c_{\eta}:=1\wedge\sqrt{\kappa}\wedge\sqrt{\kappa\log\left(1+4\eta^{2}\right)}$ . Let $0<c<c_{\eta}$ . Let $\pi$ be the prior in which a draw $\Theta\sim\pi$ is constructed by uniformly drawing $S\subset[p]$ of size $s$ and setting

\Theta_{j}=\begin{cases}ce_{1}&\textit{if }j\in S,\\ 0&\textit{otherwise}\end{cases}

where $e_{1}\in\mathbb{R}^{\mathbb{N}}$ is given by $e_{1}=(1,0,0,...)$ . Note that $\Theta\sim\pi$ implies $||\Theta||_{F}^{2}=c^{2}s$ and $\Theta\in\mathscr{T}_{s}$ . By Neyman-Pearson lemma and the inequality $1-d_{TV}(Q,P)\geq 1-\sqrt{\chi^{2}(Q||P)}/2$ , we have

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\sqrt{s}\end{subarray}}P_{\Theta}\left\{\varphi\neq 1\right\}\right\}\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}||P_{0})}

(35)

where $P_{\pi}$ denotes the mixture $\int P_{\Theta}\,\pi(d\Theta)$ induced by $\pi$ . By the Ingster-Suslina method (Proposition 5) and Corollary 3, we have

	$\displaystyle\chi^{2}(P_{\pi}\|\|P_{0})$	$\displaystyle=E\left(\exp\left(n\langle\Theta,\widetilde{\Theta}\rangle_{F}\right)\right)-1$
		$\displaystyle=E\left(\exp\left(c^{2}n\|S\cap\widetilde{S}\|\right)\right)-1$
		$\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{c^{2}n}\right)^{s}-1.$

where $\Theta,\widetilde{\Theta}\overset{iid}{\sim}\pi$ and $S,\widetilde{S}$ are the corresponding random sets. Now, using that $\log\left(1+\frac{p}{s^{2}}\right)\geq\kappa n$ , we have

	$\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{c^{2}\kappa^{-1}\log\left(1+\frac{p}{s^{2}}\right)}\right)^{s}-1$
	$\displaystyle=\left(1-\frac{s}{p}+\frac{s}{p}\left(1+\frac{p}{s^{2}}\right)^{c^{2}\kappa^{-1}}\right)^{s}-1$
	$\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}+\frac{c^{2}\kappa^{-1}}{s}\right)^{s}-1$
	$\displaystyle\leq\exp\left(\frac{c^{2}}{\kappa}\right)-1$
	$\displaystyle\leq 4\eta^{2}$

We have also used that $c^{2}\kappa^{-1}<c_{\eta}^{2}\kappa^{-1}\leq 1$ and the inequality $(1+x)^{y}\leq 1+xy$ for $x>0$ and $y\in(0,1)$ . Plugging into (35) yields the desired result.

Case 2: Suppose $s\geq\sqrt{p}$ . We can repeat the analysis done in Case 1 with the modification of replacing every instance of $s$ with $\lceil\sqrt{p}\rceil$ . ∎

Proof of Theorem 1.

We will construct a prior distribution $\pi$ on $\mathscr{T}_{s}$ and use Le Cam’s two point method to furnish a lower bound. Set $c_{\eta}:=1\wedge 2^{1/4}\wedge\left(\log\left(1+4\eta^{2}\right)\right)^{1/4}$ . Let $0<c<c_{\eta}$ . We break up the analysis into two cases.

Case 1: Suppose we are in the regime where $\psi(p,s,n)^{2}=s\Gamma_{\mathcal{H}}$ . Set $\rho:=\sqrt{\frac{\Gamma_{\mathcal{H}}}{\nu_{\mathcal{H}}}}$ . Let $\pi$ be the prior in which a draw $\Theta\sim\pi$ is obtained by uniformly drawing $S\subset[p]$ of size $s$ and drawing

\theta_{i,k}\sim\begin{cases}\operatorname{Uniform}\{-c\rho,c\rho\}&\textit{if }i\in S\text{ and }k<\nu_{\mathcal{H}},\\ \delta_{0}&\textit{otherwise}\end{cases}

where $\delta_{0}$ denotes the probability measure placing full mass at zero. Note that $\Theta\sim\pi$ implies $||\Theta||_{F}^{2}=c^{2}\rho^{2}s(\nu_{\mathcal{H}}-1)=c^{2}s\Gamma_{\mathcal{H}}\cdot\frac{\nu_{\mathcal{H}}-1}{\nu_{\mathcal{H}}}\geq\frac{c^{2}}{2}s\Gamma_{\mathcal{H}}$ . Here, we have used that $\nu_{\mathcal{H}}\geq 2$ since $\log\left(1+\frac{p}{s^{2}}\right)\leq\frac{n}{2}$ . Furthermore, consider that for $i\in S$ , we have

\displaystyle\sum_{\ell=1}^{\infty}\frac{\theta_{i,\ell}^{2}}{\mu_{\ell}}=c^{2}\rho^{2}\sum_{\ell<\nu_{\mathcal{H}}}\frac{1}{\mu_{\ell}}=c^{2}\frac{1}{\nu_{\mathcal{H}}}\sum_{\ell<\nu_{\mathcal{H}}}\frac{\Gamma_{\mathcal{H}}}{\mu_{\ell}}\leq c^{2}\frac{1}{\nu_{\mathcal{H}}}\sum_{\ell<\nu_{\mathcal{H}}}\frac{\mu_{\nu_{\mathcal{H}}-1}}{\mu_{\ell}}\leq c^{2}\leq c_{\eta}^{2}\leq 1.

Here, we have used the ordering of the eigenvalues $\mu_{1}\geq\mu_{2}\geq...\geq 0$ and $\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}-1}$ by Lemma 25. Of course, for $i\not\in S$ we have $\Theta_{i}=0$ . Hence we have $\Theta\in\mathscr{T}_{s}$ . We then have by the Neyman-Pearson lemma and the inequality $1-d_{TV}(Q,P)\geq 1-\sqrt{\chi^{2}(Q||P)}/2$ that

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\psi(p,s,n)\end{subarray}}P_{\Theta}\left\{\varphi\neq 1\right\}\right\}\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}||P_{0})}.

(36)

For the following calculations, let $\Theta,\Theta^{\prime}\overset{iid}{\sim}\pi$ and $S,S^{\prime}$ are the corresponding random sets. Also, let $\left\{r_{i\ell},r^{\prime}_{i\ell}\right\}_{1\leq i\leq p,\ell\in\mathbb{N}}$ denote an iid collection of $\operatorname{Rademacher}\left(1/2\right)$ random variables. By the Ingster-Suslina method (Proposition 5), independence of $\{S,S^{\prime}\}$ with $\left\{r_{i\ell},r^{\prime}_{i\ell}\right\}_{1\leq i\leq p,\ell\in\mathbb{N}}$ , and Corollary 3, we have

	$\displaystyle\chi^{2}(P_{\pi}\|\|P_{0})$	$\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)-1$
		$\displaystyle=E\left(\exp\left(nc^{2}\rho^{2}\sum_{i\in S\cap S^{\prime}}\sum_{\ell\leq\nu_{\mathcal{H}}}r_{i\ell}r^{\prime}_{i\ell}\right)\right)-1$
		$\displaystyle=E\left(\prod_{i\in S\cap S^{\prime}}E\left(\exp\left(nc^{2}\rho^{2}\sum_{\ell\leq\nu_{\mathcal{H}}}r_{i\ell}r^{\prime}_{i\ell}\right)\right)\right)-1$
		$\displaystyle=E\left(\cosh\left(nc^{2}\rho^{2}\right)^{\nu_{\mathcal{H}}\|S\cap S^{\prime}\|}\right)-1$
		$\displaystyle\leq E\left(\exp\left(\frac{c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}}{2}\|S\cap S^{\prime}\|\right)\right)-1$
		$\displaystyle\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{\frac{c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}}{2}}\right)^{s}-1.$

Consider by Lemma 1 that $c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}=\frac{c^{4}n^{2}\Gamma_{\mathcal{H}}^{2}}{\nu_{\mathcal{H}}}\leq c^{4}\log\left(1+\frac{p}{s^{2}}\right)$ . Therefore,

\displaystyle\left(1-\frac{s}{p}+\frac{s}{p}e^{\frac{c^{4}n^{2}\rho^{4}\nu_{\mathcal{H}}}{2}}\right)^{s}-1\leq\left(1-\frac{s}{p}+\frac{s}{p}e^{\frac{c^{4}}{2}\log\left(1+\frac{p}{s^{2}}\right)}\right)^{s}-1\leq\left(1+\frac{c^{4}}{2s}\right)^{s}-1\leq e^{c_{\eta}^{4}/2}-1=4\eta^{2}.

We have used $\frac{c^{4}}{2}<\frac{c_{\eta}^{4}}{2}\leq 1$ and the inequality $(1+x)^{y}\leq 1+xy$ for $x>0$ and $y\in(0,1)$ . Using $\chi^{2}(P_{\pi}||P_{0})\leq 4\eta^{2}$ with (36) yields the desired result.

Case 2: Suppose $s<\sqrt{p}$ and $\psi(p,s,n)^{2}=\frac{s}{n}\log\left(1+\frac{p}{s^{2}}\right)$ . Set $\rho=1$ . Let $\pi$ be the prior in which a draw $\Theta\sim\pi$ is obtained by uniformly drawing $S\subset[p]$ of size $s$ and drawing

\theta_{i,k}\sim\begin{cases}\operatorname{Uniform}\{-c\rho,c\rho\}&\textit{if }i\in S\text{ and }k=1,\\ \delta_{0}&\textit{otherwise}.\end{cases}

The desired result can be proved by arguing in a manner similar to that as in the proof of Theorem 1 (see also [10]). Details are omitted for the sake of brevity. ∎

7.3 Adaptive lower bound

Proof of Theorem 2.

Recall the definitions of $\mathscr{A}_{\mathcal{H}}$ in (15) and $\widetilde{\mathscr{V}}_{\mathcal{H}}$ in (20). By assumption, we have $\mathscr{A}_{\mathcal{H}}\leq L\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)$ for a universal constant $L>0$ . For ease of notation, let us denote $\psi=\psi_{\text{adapt}}(p,s,n)$ where $\psi_{\text{adapt}}$ is given by (19). Set $c_{\eta}:=(16L)^{-1/4}\wedge\left(\frac{\eta^{2}\log\left(1+2\eta^{2}\right)}{32L}\right)^{1/4}$ and let $0<c<c_{\eta}$ . We now define a prior $\pi$ . A draw $\Theta\sim\pi$ is obtained as follows. First, draw $v\sim\text{Uniform}(\widetilde{\mathscr{V}}_{\mathcal{H}})$ . Set $s:=\min\left\{s^{\prime}\in[p]:s^{\prime}\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}\text{ and }\frac{v}{2}<\nu_{\mathcal{H}}(s^{\prime},\mathscr{A}_{\mathcal{H}})\leq v\right\}$ . Then draw uniformly at random a subset $S\subset[p]$ of size exactly $s$ . Let $d_{s}:=2^{k_{s}}\in\widetilde{\mathscr{V}}_{\mathcal{H}}$ satisfy $2^{k_{s}-1}<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\leq 2^{k_{s}}$ . Draw independently

\theta_{k,j}\sim\begin{cases}\operatorname{Uniform}\left\{-\sqrt{2}c\rho_{s},\sqrt{2}c\rho_{s}\right\}&\textit{if }j\in S\text{ and }1\leq k<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}}),\\ \delta_{0}&\textit{otherwise}.\end{cases}

Here, $\rho_{s}=\sqrt{\frac{\psi^{2}/s}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}}$ . This description concludes the definition of $\pi$ .

Now we must show that $\pi$ is indeed supported on $\bigcup_{s}\mathscr{T}_{s}$ . Note that when we draw $\Theta\sim\pi$ , the associated sparsity level $s$ always satisfies $s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}$ . Furthermore, conditional on $s$ we have

||\Theta||_{F}^{2}=2c^{2}\frac{\psi^{2}}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}(\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})-1)\geq c^{2}\psi^{2}.

Furthermore, consider that for $j\in S$ we have

	$\displaystyle\sum_{\ell=1}^{\infty}\frac{\theta_{\ell,j}^{2}}{\mu_{\ell}}$	$\displaystyle\leq 2c^{2}\rho^{2}\sum_{\ell<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\frac{1}{\mu_{\ell}}$
		$\displaystyle=2c^{2}\frac{1}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\sum_{\ell<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\frac{\psi^{2}/s}{\mu_{\ell}}$
		$\displaystyle\leq 2c^{2}\frac{1}{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\sum_{\ell<\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}\frac{\mu_{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}}{\mu_{\ell}}$
		$\displaystyle\leq 2c^{2}$
		$\displaystyle\leq 1$

where the last line follows from $c<c_{\eta}$ . Note we have used the ordering of the eigenvalues as well as $\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})\leq\mu_{\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})-1}$ by Lemma 26. Of course, for $j\not\in S$ we have $\Theta_{j}=0$ . Hence, $\Theta\in\mathscr{T}_{s}$ and so $\pi$ is properly supported.

Writing $P_{\pi}=\int P_{\Theta}\pi(d\Theta)$ for the mixture, we have

\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi=1\right\}+\max_{s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}}\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\psi_{\text{adapt}}(p,s,n)\end{subarray}}P_{\Theta}\left\{\varphi=0\right\}\right\}

\displaystyle\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}\,||\,P_{0})}.

(37)

For the following calculations, let $\Theta,\Theta^{\prime}\overset{iid}{\sim}\pi$ . Let $v,v^{\prime}\in\widetilde{\mathscr{V}}_{\mathcal{H}}$ be the corresponding quantities. Let $s$ and $t$ denote the corresponding sparsities. Denote the corresponding support sets $S$ and $T$ . Also, let $\{r_{i\ell},\tilde{r}_{i\ell}\}_{1\leq i\leq p,\ell\in\mathbb{N}}$ denote an iid collection of $\operatorname{Rademacher}\left(1/2\right)$ random variables which is independent of $s,t,S,$ and $T$ . For ease of notation, let $\nu_{s}=\nu_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})$ . Likewise, let $\nu_{t}=\nu_{\mathcal{H}}(t,\mathscr{A}_{\mathcal{H}})$ . By the Ingster-Suslina method (Proposition 5), we have

	$\displaystyle\chi^{2}\left(P_{\pi}\,\|\|\,P_{0}\right)+1$	$\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)$
		$\displaystyle=E\left(\exp\left(2nc^{2}\rho_{s}\rho_{t}\sum_{i\in S\cap T}\sum_{1<\ell<\nu_{s}\wedge\nu_{t}}r_{i\ell}\tilde{r}_{i\ell}\right)\right)$
		$\displaystyle=E\left(\prod_{i\in S\cap T}\prod_{1<\ell<\nu_{s}\wedge\nu_{t}}\cosh\left(2nc^{2}\rho_{s}\rho_{t}\right)\right)$
		$\displaystyle\leq E\left(\exp\left(2n^{2}c^{4}\rho_{s}^{2}\rho_{t}^{2}(\nu_{s}\wedge\nu_{t})\|S\cap T\|\right)\right)$
		$\displaystyle\leq E\left(\exp\left(2n^{2}c^{4}\rho_{s}^{2}\rho_{t}^{2}(d_{s}\wedge d_{t})\|S\cap T\|\right)\right).$

Here, we have used the inequality $\cosh(x)\leq e^{x^{2}/2}$ for $x>0$ . Consider that by Lemma 26

	$\displaystyle 2n^{2}c^{4}\rho_{s}^{2}\rho_{t}^{2}(d_{s}\wedge d_{t})$	$\displaystyle=2n^{2}c^{4}\cdot\frac{\Gamma_{\mathcal{H}}(s,\mathscr{A}_{\mathcal{H}})}{\nu_{s}}\frac{\Gamma_{\mathcal{H}}(t,\mathscr{A}_{\mathcal{H}})}{\nu_{t}}\cdot(d_{s}\wedge d_{t})$
		$\displaystyle\leq 2c^{4}\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{t^{2}}\right)}\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{\nu_{s}\nu_{t}}}$
		$\displaystyle\leq 4c^{4}\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{t^{2}}\right)}\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}$
		$\displaystyle\leq 4c^{4}\sqrt{\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\cdot\frac{p\mathscr{A}_{\mathcal{H}}}{t^{2}}}\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}$
		$\displaystyle\leq 8c^{4}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}$

We have used $s,t\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}$ along with the inequality $\frac{u}{2}\leq\log(1+u)\leq u$ for $0\leq u\leq 1$ . Therefore,

\displaystyle E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)\leq E\left(\exp\left(8c^{4}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}|S\cap T|\right)\right).

(38)

Since $8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\leq 1$ , we can use Lemma 28 and the inequality $(1+x)^{\delta}\leq 1+\delta x$ for $x>0$ and $\delta\leq 1$ to obtain

	$\displaystyle\chi^{2}\left(P_{\pi}\,\|\|\,P_{0}\right)+1$	$\displaystyle\leq E\left(\exp\left(8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\|S\cap T\|\right)\right)$
		$\displaystyle\leq E\left(\left(1-\frac{s}{p}+\frac{s}{p}\exp\left(8c^{4}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)\cdot\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\right)\right)^{t}\right)$
		$\displaystyle=E\left(\left(1-\frac{s}{p}+\frac{s}{p}\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{st}\right)^{8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}}\right)^{t}\right)$
		$\displaystyle\leq E\left(\left(1+\frac{1}{t}\cdot 8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\mathscr{A}_{\mathcal{H}}\right)^{t}\right)$
		$\displaystyle\leq E\left(\exp\left(8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\mathscr{A}_{\mathcal{H}}\right)\right).$

Recall we write $d_{s}=2^{k_{s}}$ and $d_{t}=2^{k_{t}}$ . Moreover, recall $d_{s},d_{t}\in\widetilde{\mathscr{V}}_{\mathcal{H}}$ . Now observe

E\left(\exp\left(8c^{4}\frac{d_{s}\wedge d_{t}}{\sqrt{d_{s}d_{t}}}\mathscr{A}_{\mathcal{H}}\right)\right)=E\left(\exp\left(8c^{4}2^{-\frac{|k_{s}-k_{t}|}{2}}\mathscr{A}_{\mathcal{H}}\right)\right).

Recall we have $\mathscr{A}_{\mathcal{H}}\leq L\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)$ . Define the sets

	$\displaystyle E_{1}$	$\displaystyle:=\left\{(k_{s},k_{t})\in\mathbb{N}\times\mathbb{N}:2^{k_{s}},2^{k_{t}}\in\widetilde{\mathscr{V}}_{\mathcal{H}}\text{ and }\|k_{s}-k_{t}\|\leq\frac{\eta^{2}}{2e^{8Lc^{4}}}\log_{2}(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)\right\},$
	$\displaystyle E_{2}$	$\displaystyle:=\left\{(k_{s},k_{t})\in\mathbb{N}\times\mathbb{N}:2^{k_{s}},2^{k_{t}}\in\widetilde{\mathscr{V}}_{\mathcal{H}}\text{ and }\|k_{s}-k_{t}\|>\frac{\eta^{2}}{2e^{8Lc^{4}}}\log_{2}(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)\right\}.$

Then

	$\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\right)$
	$\displaystyle=E\left(\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{1}\}}\right)+E\left(\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{2}\}}\right).$

Let us examine the second term. Note that for any $\ell>0$ we have $x^{-\ell}\log(x)\leq(e\ell)^{-1}$ for all $x>0$ . Using this, we obtain

	$\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{2}\}}\right)$	$\displaystyle\leq\exp\left(8Lc^{4}\left(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|\right)^{-\frac{\eta^{2}}{4e^{8Lc^{4}}}}\mathscr{A}_{\mathcal{H}}\right)$
		$\displaystyle\leq\exp\left(8Lc^{4}\left(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|\right)^{-\frac{\eta^{2}}{4e^{8Lc^{4}}}}\log(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)\right)$
		$\displaystyle\leq\exp\left(8Lc^{4}\cdot\frac{4e^{8Lc^{4}-1}}{\eta^{2}}\right)$
		$\displaystyle\leq\exp\left(\frac{32Lc^{4}}{\eta^{2}}\right)$
		$\displaystyle\leq 2\eta^{2}+1.$

The final line results from $c^{4}<c_{\eta}^{4}\leq\frac{1}{32L}\eta^{2}\log\left(1+2\eta^{2}\right)$ . Note we have also used $8Lc^{4}-1\leq 0$ to obtain the penultimate line. We now examine $E_{1}$ . Consider that $|E_{1}|\leq\frac{\eta^{2}}{2e^{8Lc^{4}}}|\widetilde{\mathscr{V}}_{\mathcal{H}}|\log_{2}(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)$ . Therefore,

	$\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{1}\}}\right)$	$\displaystyle=\frac{1}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{2}}\sum_{v,v^{\prime}\in\widetilde{\mathscr{V}}_{\mathcal{H}}}\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(2^{k_{s}},2^{k_{t}})\in E_{1}\}}$
		$\displaystyle\leq\frac{\|E_{1}\|}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{2}}\exp\left(8c^{4}\mathscr{A}_{\mathcal{H}}\right)$
		$\displaystyle\leq\frac{\frac{\eta^{2}}{2e^{8Lc^{4}}}\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|\log_{2}(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{2}}\cdot\exp\left(8Lc^{4}\log(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)\right)$
		$\displaystyle\leq\frac{\eta^{2}}{2}\cdot\frac{\log_{2}(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{1-8Lc^{4}}}$
		$\displaystyle\leq\eta^{2}+\frac{\eta^{2}}{2\log 2}\cdot\frac{\log(\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{1-8Lc^{4}}}$
		$\displaystyle\leq\eta^{2}+\frac{\eta^{2}}{2\log 2}$
		$\displaystyle\leq 2\eta^{2}.$

Note we have used $\mathscr{A}_{\mathcal{H}}\leq L\log(e|\widetilde{\mathscr{V}}_{\mathcal{H}}|)$ . We have also used $8Lc^{4}<8Lc_{\eta}^{4}\leq\frac{1}{2}$ along with the inequality $\log(x)\leq\sqrt{x}$ for all $x>0$ to obtain the penultimate line. Therefore, we have shown

\chi^{2}(P_{\pi}\,||\,P_{0})\leq 2\eta^{2}+\left(2\eta^{2}+1\right)-1=4\eta^{2}.

Plugging into (37) yields

\inf_{\varphi}\left\{P_{0}\left\{\varphi=1\right\}+\max_{s\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}}\sup_{\begin{subarray}{c}\Theta\in\mathscr{T}_{s},\\ ||\Theta||_{F}\geq c\psi_{\text{adapt}}(p,s,n)\end{subarray}}P_{\Theta}\left\{\varphi=0\right\}\right\}\geq 1-\eta.

The proof is complete. ∎

7.4 Adaptive upper bound

Proof of Theorem 3.

Fix $\eta\in(0,1)$ . In various places of the proof, we will point out that $C_{\eta}$ can be taken sufficiently large to obtain desired bounds, so for now let $C>C_{\eta}$ . Let $L^{*}$ denote the universal constant from Lemma 21. Let $D$ denote the maximum of the corresponding universal constants from Lemmas 15 and 11. Set

	$\displaystyle K_{2}$	$\displaystyle:=1\vee\frac{1}{(\log 2)^{1/4}}\vee\left(\frac{2}{c}\right)^{1/4}\vee\left(\frac{L^{*}}{c}\right)^{1/4},$
	$\displaystyle K_{3}$	$\displaystyle:=\frac{L^{*}}{K_{2}^{2}},$
	$\displaystyle K_{2}^{\prime}$	$\displaystyle:=\frac{1}{\sqrt{\log 2}}\vee\left(\frac{2}{c^{\prime}}\right)^{1/2}\vee(c^{\prime}K_{3}^{2})^{-1/4}.$

Here, $c:=c^{*}\wedge c^{**}$ with $c^{*}$ and $c^{**}$ being the universal constants in the exponential terms of Lemmas 15 and 14 respectively. Likewise, $c^{\prime}:=(c^{\prime})^{*}\wedge(c^{\prime})^{**}$ where $(c^{\prime})^{*}$ and $(c^{\prime})^{**}$ are the universal constants in the exponential terms of Lemmas 11 and 10. Note that these choices of $K_{2},K_{3},K_{2}^{\prime}$ are almost identical to the choices in the proofs of Propositions 2 and 3. The only modifications are the terms $(2/c)^{1/4}$ and $(2/c^{\prime})^{1/4}$ , and the utility of this modification will become clear through the course of the proof.

For any $\nu$ , define

	$\displaystyle\mathcal{S}_{\text{bulk}}$	$\displaystyle:=\left\{s\in\mathscr{S}:s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\text{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\leq K_{3}\sqrt{d_{\nu}}\right\},$
	$\displaystyle\mathcal{S}_{\text{tail}}$	$\displaystyle:=\left\{s\in\mathscr{S}:s<\sqrt{p\mathscr{A}_{\mathcal{H}}}\text{ and }\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}>K_{3}\sqrt{d_{\nu}}\right\}.$

We examine the Type I and Type II errors separately. Focusing on the Type I error, union bound yields

		$\displaystyle P_{0}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=1\right\}$
		$\displaystyle\leq\left(\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\right)+\left(\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\right)+\left(\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P_{0}\left\{\varphi_{\nu,p}=1\right\}\right).$		(39)

We bound each term separately.

Type I error: Bulk

For $s\in\mathcal{S}_{\text{bulk}}$ ,

\displaystyle P_{0}\left\{\varphi_{\nu,s}=1\right\}

\displaystyle=P_{0}\left\{T_{r_{\nu,s}}(d_{\nu})>Cs\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}\right\}.

Consider that Lemma 15 gives us the following. If we select $x>0$ satisfying

C^{*}\left(\sqrt{pr_{\nu,s}^{4}e^{-c^{*}\frac{r_{\nu,s}^{2}}{d}}x}+\frac{d_{\nu}}{r_{\nu,s}^{2}}x\right)\leq Cs\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)},

(40)

then we have $P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq e^{-x}$ . Let us select

x=\frac{C}{4(C^{*}\vee(C^{*})^{2})}\cdot\frac{1}{(2K_{2}^{4}\lceil D\rceil)\vee(\sqrt{2\lceil D\rceil}/K_{2}^{2})}\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right).

To verify (40), consider that $c^{*}K_{2}^{4}\geq 2$ by our choice of $K_{2}$ , and so

	$\displaystyle\sqrt{pr_{\nu,s}^{4}e^{-c^{*}\frac{r_{\nu,s}^{2}}{d}}x}$	$\displaystyle=\sqrt{pK_{2}^{4}d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\frac{s^{4}}{(s^{2}+p\mathscr{A}_{\mathcal{H}})^{2}}x}$
		$\displaystyle\leq\frac{\sqrt{C}}{2C^{*}}\frac{1}{\sqrt{2\lceil D\rceil}}s\sqrt{d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}$
		$\displaystyle\leq\frac{\sqrt{C}}{2C^{*}}s\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}.$

Here, we have used that $d_{\nu}=\nu\vee\lceil D\rceil$ . We have also used $C\geq 1$ , which holds provided we select $C_{\eta}$ large enough (i.e. $C_{\eta}\geq 1$ ).

Likewise, consider

	$\displaystyle\frac{d_{\nu}}{r_{\nu,s}^{2}}x$	$\displaystyle=\frac{d_{\nu}}{K_{2}^{2}\sqrt{d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}}x$
		$\displaystyle\leq\frac{C}{4\sqrt{2\lceil D\rceil}C^{*}}s\sqrt{d_{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}$
		$\displaystyle\leq\frac{C}{4C^{*}}s\sqrt{\nu\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)}.$

Therefore, (40) is satisfied, and so we have

P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)

where $\kappa$ is the universal constant $\kappa=\frac{1}{4(C^{*}\vee(C^{*})^{2})}\cdot\frac{1}{(2K_{2}^{4}\lceil D\rceil)\vee(\sqrt{2}/K_{2}^{2})}$ . With this bound in hand, observe

	$\displaystyle\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}$
	$\displaystyle\leq\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)$
	$\displaystyle\leq\|\mathscr{V}_{\mathcal{H}}\|\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\wedge 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)\right)$
	$\displaystyle\leq e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\right)+e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right).$

Here we have used $\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}}$ from Lemma 3. We bound each term separately. First, consider

	$\displaystyle e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\right)$	$\displaystyle=e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\mathscr{A}_{\mathcal{H}}\cdot\left(\frac{\sqrt{p\mathscr{A}_{\mathcal{H}}}}{2^{k}}\right)^{2}\right)$
		$\displaystyle\leq\sum_{k=0}^{\infty}\exp\left(-(C\kappa-2)\mathscr{A}_{\mathcal{H}}4^{k}\right)$
		$\displaystyle\leq\frac{\eta}{12}$

provided we take $C_{\eta}$ sufficiently large.

Likewise, consider

	$\displaystyle e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)$	$\displaystyle\leq e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-\frac{C\kappa}{2}2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)-\frac{C\kappa}{2}\mathscr{A}_{\mathcal{H}}\right)$
		$\displaystyle\leq e^{-\left(\frac{C\kappa}{2}-2\right)\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-\frac{C\kappa}{2}2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)$
		$\displaystyle\leq\frac{\eta}{12}$

again, provided we take $C_{\eta}$ sufficiently large. We have used that $\mathscr{A}_{\mathcal{H}}$ exhibits at most logarithmic growth in $p$ , i.e. we use the crude bound $\mathscr{A}_{\mathcal{H}}\leq\log(ep)$ . Therefore, we have established

\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{bulk}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\frac{\eta}{6}.

(41)

Type I error: Tail

For $s\in\mathcal{S}_{\text{tail}}$ , we have

P_{0}\left\{\varphi_{\nu,s}=1\right\}=P_{0}\left\{T_{r_{s}^{\prime}}(d_{\nu})>Cs\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right\}.

Consider that Lemma 11 gives us an analogue to (40), namely if we select $x>0$ satisfying

C^{**}\left(\sqrt{p(r_{s}^{\prime})^{4}e^{-c^{**}(r_{s}^{\prime})^{2}}x}+x\right)\leq Cn\tau^{2}_{\text{adapt}}(p,s,n)^{2},

(42)

then we have $P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq e^{-x}$ . Let us select

x=\frac{C}{4}\cdot\frac{1}{2\left((C^{**})\vee(C^{**})^{2}\right)\left((K_{2}^{\prime})^{4}\vee 1\right)}\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right).

To verify (42), consider that $c^{**}(K_{2}^{\prime})^{2}\geq 2$ by our choice of $K_{2}^{\prime}$ , and so

	$\displaystyle\sqrt{p(r_{s}^{\prime})^{4}e^{-c^{**}(r_{s}^{\prime})^{2}}x}$	$\displaystyle=(K_{2}^{\prime})^{2}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\sqrt{p\frac{s^{4}}{(s^{2}+p\mathscr{A}_{\mathcal{H}})^{2}}x}$
		$\displaystyle\leq\sqrt{C}\frac{1}{2C^{**}}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\sqrt{p\frac{s^{4}}{(s^{2}+p\mathscr{A}_{\mathcal{H}})^{2}}\cdot\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}}$
		$\displaystyle\leq\frac{C}{2C^{**}}s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right).$

Here, we have used $C\geq 1$ . Likewise, consider that

x\leq\frac{C}{2C^{**}}\cdot s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right).

Therefore, (42) is satisfied, and so we have

P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)

where $\kappa$ is the universal constant $\kappa=\frac{1}{4}\cdot\frac{1}{2\left((C^{**})\vee(C^{**})^{2}\right)}$ . With this bound in hand, observe

	$\displaystyle\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}$
	$\displaystyle\leq\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{s^{2}}\wedge s\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{s^{2}}\right)\right)\right)$
	$\displaystyle\leq\|\mathscr{V}_{\mathcal{H}}\|\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\left(\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\wedge 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right)\right)$
	$\displaystyle\leq e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa\frac{p\mathscr{A}_{\mathcal{H}}^{2}}{2^{2k}}\right)+e^{2\mathscr{A}_{\mathcal{H}}}\sum_{\begin{subarray}{c}k\in\mathbb{N}\cup\{0\}:\\ 2^{k}<\sqrt{p\mathscr{A}_{\mathcal{H}}}\end{subarray}}\exp\left(-C\kappa 2^{k}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{2^{2k}}\right)\right).$

Here we have used $\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}}$ from Lemma 3. From here, the same argument from the bulk case can be employed to conclude

\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}\sum_{s\in\mathcal{S}_{\text{tail}}}P_{0}\left\{\varphi_{\nu,s}=1\right\}\leq\frac{\eta}{6}

(43)

provided $C_{\eta}$ is taken sufficiently large.

Type I error: Dense

Let us now bound $\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P_{0}\left\{\varphi_{\nu,p}=1\right\}$ . Consider that for any $\nu\in\mathscr{V}_{\mathcal{H}}$ , we have

	$\displaystyle P_{0}\left\{\varphi_{\nu,p}=1\right\}$	$\displaystyle=P_{0}\left\{n\sum_{j=1}^{p}\sum_{k\leq\nu}X_{k,j}^{2}>C\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\right\}$
		$\displaystyle=P\left\{\chi^{2}_{\nu p}>C\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\right\}$
		$\displaystyle\leq P\left\{\chi^{2}_{\nu p}>\frac{C}{2}\left(\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}+\mathscr{A}_{\mathcal{H}}\right)\right\}$
		$\displaystyle\leq P\left\{\chi^{2}_{\nu p}>2\sqrt{\nu p\cdot\frac{C}{4}\mathscr{A}_{\mathcal{H}}}+2\cdot\frac{C}{4}\mathscr{A}_{\mathcal{H}}\right\}$
		$\displaystyle\leq e^{-\frac{C}{4}\mathscr{A}_{\mathcal{H}}}.$

To obtain the fourth line, we have used $\mathscr{A}_{\mathcal{H}}\leq\log(ep)$ which implies $\sqrt{\nu p\mathscr{A}_{\mathcal{H}}}\geq\mathscr{A}_{\mathcal{H}}$ . In the above display, we have also used $\frac{C}{2}\geq 1$ (which holds provided we take $C_{\eta}$ large enough) to obtain the penultimate line and Lemma 16 to obtain the final line. Consequently,

\sum_{\nu\in\mathscr{V}_{\mathcal{H}}}P_{0}\left\{\varphi_{\nu,p}=1\right\}\leq|\mathscr{V}_{\mathcal{H}}|e^{-\frac{C}{4}\mathscr{A}_{\mathcal{H}}}\leq e^{-\left(\frac{C}{4}-2\right)\mathscr{A}_{\mathcal{H}}}\leq\frac{\eta}{6}

(44)

since $\log(e|\mathscr{V}_{\mathcal{H}}|)\leq 2\mathscr{A}_{\mathcal{H}}$ by Lemma 3 and $\frac{C}{4}-2\geq\log(6/\eta)$ , which holds provided we take $C_{\eta}$ sufficiently large.

Putting together (41), (43), (44) into (7.4) yields the Type I error bound

P_{0}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=1\right\}\leq\frac{\eta}{6}+\frac{\eta}{6}+\frac{\eta}{6}=\frac{\eta}{2}.

(45)

Type II error: We now examine the Type II error. Suppose $s^{*}\in[p]$ . Let $f\in\mathcal{F}_{s^{*}}$ with $||f||_{2}\geq C\tau_{\text{adapt}}(p,s^{*},n)$ . We proceed by considering various cases.

Type II error: Dense

Suppose $s^{*}\geq\sqrt{p\mathscr{A}_{\mathcal{H}}}$ . Let $\tilde{\nu}$ denote the smallest element in $\mathscr{V}_{\mathcal{H}}$ which is greater than or equal to $\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})$ . Then

\displaystyle P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=0\right\}

\displaystyle\leq P_{f}\left\{\varphi_{\tilde{\nu},p}=0\right\}=P_{f}\left\{n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}X_{k,j}^{2}>\tilde{\nu}p+C\sqrt{\tilde{\nu}p\mathscr{A}_{\mathcal{H}}}\right\}.

Consider that

n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}X_{k,j}^{2}\sim\chi^{2}_{p\tilde{\nu}}\left(n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\right)

where the collection $\{\theta_{k,j}\}$ denotes the collection of basis coefficients for $f$ . We will also use the matrix $\Theta$ to denote the collection of basis coefficients; note that $\Theta\in\mathscr{T}_{s^{*}}$ . Let $S^{*}$ denote the set of $j$ for which $\Theta_{j}$ are nonzero. Observe

	$\displaystyle C^{2}\tau_{\text{adapt}}^{2}(p,s^{*},n)$	$\displaystyle\leq\|\|f\|\|_{2}^{2}$
		$\displaystyle=\sum_{j=1}^{p}\sum_{k=1}^{\infty}\theta_{k,j}^{2}$
		$\displaystyle\leq\sum_{j\in S^{*}}\left(\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}+\sum_{k>\tilde{\nu}}\theta_{k,j}^{2}\right)$
		$\displaystyle\leq\sum_{j\in S^{*}}\left(\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}+\mu_{\tilde{\nu}}\sum_{k>\tilde{\nu}}\frac{\theta_{k,j}^{2}}{\mu_{k}}\right)$
		$\displaystyle\leq\left(\sum_{j\in S^{}}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\right)+s^{}\mu_{\tilde{\nu}}$
		$\displaystyle=\left(\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\right)+s^{*}\mu_{\tilde{\nu}}.$

Therefore,

	$\displaystyle n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}$	$\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)-ns^{}\mu_{\tilde{\nu}}$
		$\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)-s^{}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}$
		$\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)-\sqrt{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)$
		$\displaystyle\geq(C^{2}-\sqrt{2})n\tau_{\text{adapt}}^{2}(p,s^{*},n)$

We have used that $\tilde{\nu}\leq 2\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})$ to obtain the third line. Therefore, we have

n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}\geq\left(C^{2}-\sqrt{2}\right)n\tau_{\text{adapt}}^{2}(p,s^{*},n)\geq\frac{C^{2}-\sqrt{2}}{\sqrt{2}}\sqrt{\tilde{\nu}p\mathscr{A}_{\mathcal{H}}}.

The signal magnitude satisfies the requisite strength needed to successfully detect, which can be seen by following the argument of Proposition 4. Hence, we can achieve Type II error less than or equal to $\frac{\eta}{2}$ by selecting $C_{\eta}$ sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by $\eta$ , as desired.

Type II error: Bulk

Suppose $s^{*}<\sqrt{p\mathscr{A}_{\mathcal{H}}}$ and $\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}\leq K_{3}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\vee\lceil D\rceil}$ . Let $\tilde{\nu}$ be the smallest element in $\mathscr{V}_{\mathcal{H}}$ greater than or equal to $\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})$ . Let $\tilde{s}$ be the smallest element in $\mathscr{S}$ greater than or equal to $s^{*}$ . By the definitions of these grids, we have $\tilde{\nu}/2<\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\leq\tilde{\nu}$ and $s^{*}\leq\tilde{s}\leq 2s^{*}$ . Then

P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=0\right\}\leq P_{f}\left\{\varphi_{\tilde{\nu},\tilde{s}}=0\right\}.

With these choices, we have

\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}\leq K_{3}\sqrt{d_{\tilde{\nu}}}.

Note that since $\tilde{s}\geq s^{*}$ , we have $f\in\mathcal{F}_{\tilde{s}}$ . Following argument similar to those in the proof of Proposition 2, it can be seen that the necessary signal strength to successfully detect is of squared order

\frac{\tilde{s}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}}{n}\asymp\frac{s^{*}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}}{n}.

This is precisely the order of $\tau^{2}_{\text{adapt}}(p,s^{*},n)$ , and so we can obtain Type II error less than $\frac{\eta}{2}$ by choosing $C_{\eta}$ sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by $\eta$ , as desired.

Type II error: Tail

Suppose $s^{*}<\sqrt{p\mathscr{A}_{\mathcal{H}}}$ and $\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}>K_{3}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\vee\lceil D\rceil}$ . Let $\tilde{\nu}$ and $\tilde{s}$ be as defined in the previous case, i.e. Type II error analysis for the bulk case. As before, we have

P_{f}\left\{\max_{\nu\in\mathscr{V}_{\mathcal{H}}}\max_{s\in\mathscr{S}}\varphi_{\nu,s}=0\right\}\leq P_{f}\left\{\varphi_{\tilde{\nu},\tilde{s}}=0\right\}.

Now, consider that $s^{*}\leq\tilde{s}$ and so $f\in\mathcal{F}_{\tilde{s}}$ .

Suppose we have

\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}>K_{3}\sqrt{d_{\tilde{\nu}}}.

We can follow arguments similar to those in the proof of Proposition 3 to see that the necessary signal strength to successfully detect is of squared order

\frac{\tilde{s}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}{n}\asymp\frac{s^{*}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}{n}.

This is precisely the order of $\tau^{2}_{\text{adapt}}(p,s^{*},n)$ and so we can obtain Type II error less than $\frac{\eta}{2}$ by choosing $C_{\eta}$ sufficiently large.

On the other hand, suppose we have

\sqrt{\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}\leq K_{3}\sqrt{d_{\tilde{\nu}}}

As argued in the previous section about the bulk regime, the necessary signal strength to successfully detect is of squared order

\frac{\tilde{s}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{\tilde{s}^{2}}\right)}}{n}\asymp\frac{s^{*}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}}{n}.

Consider that

\displaystyle\tau^{2}_{\text{adapt}}(p,s^{*},n)=\frac{s^{*}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}{n}\gtrsim\frac{s^{*}\sqrt{\nu_{\mathcal{H}}(s^{*},\mathscr{A}_{\mathcal{H}})\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}}{n}

and so we can obtain Type II error less than $\frac{\eta}{2}$ by choosing $C_{\eta}$ sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by $\eta$ , as desired. ∎

7.5 Smoothness and sparsity adaptive lower bounds

Proof of Theorem 5.

Fix $\eta\in(0,1)$ . Fix any $s\geq p^{1/2+\delta}\sqrt{\log\log n}$ . Through the course of the proof, we will note where we must take $c_{\eta}$ suitably small enough, so for now let $0<c<c_{\eta}$ . For each $s$ , define the geometric grid

\mathcal{V}_{s}:=\left\{2^{k}:k\in\mathbb{N}\text{ and }\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha_{1}+1}}\leq 2^{k}\leq\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha_{0}+1}}\right\}.

Note $\log|\mathcal{V}_{s}|\asymp\log\log(np)$ .

We now define a prior $\pi$ which is supported on the alternative hypothesis. Let $\nu\sim\text{Uniform}(\mathcal{V}_{s})$ . Let $\alpha=\alpha(\nu,s)$ denote the solution to

\nu=\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha+1}}.

Note that $\alpha(\nu,s)\in[\alpha_{0},\alpha_{1}]$ for any $\nu\in\mathcal{V}_{s}$ . Note $\alpha$ is random since $\nu$ is random. Draw uniformly at random a subset $S\subset[p]$ of size $s$ . Draw independently

\theta_{k,j}\sim\begin{cases}\operatorname{Uniform}\left\{-c\rho_{\nu},c\rho_{\nu}\right\}&\textit{if }j\in S\text{ and }k\leq\nu,\\ \delta_{0}&\textit{otherwise}.\end{cases}

Here, $\rho_{\nu}$ is given by $\rho_{\nu}:=\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{2\alpha+1}{4\alpha+1}}$ . Having defined $\rho_{\nu}$ , the definition of the prior $\pi$ is complete.

Now we must show $\pi$ is indeed supported on the alternative hypothesis. Consider that for $\Theta\sim\pi$ , we have

\displaystyle||\Theta||_{F}^{2}=c^{2}s\rho_{\nu}^{2}\nu=c^{2}s\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha}{4\alpha+1}}=c^{2}\tau_{\text{dense}}^{2}(p,s,n,\alpha).

Furthermore, consider that for $i\in S$ we have by definition of $\alpha=\alpha(\nu,s)$ ,

\sum_{\ell=1}^{\infty}\frac{\theta_{i,\ell}^{2}}{\ell^{-2\alpha}}=c^{2}\rho_{\nu}^{2}\nu^{2\alpha}\sum_{\ell\leq\nu}\frac{\ell^{2\alpha}}{\nu^{2\alpha}}\leq c^{2}\rho_{\nu}^{2}\nu^{2\alpha+1}=c^{2}\rho_{\nu}^{2}\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{\frac{4\alpha+2}{4\alpha+1}}\leq c^{2}\leq 1.

We can take $c_{\eta}\leq 1$ to ensure $c^{2}\leq 1$ . Of course, for $i\in S^{c}$ we have $\Theta_{i}=0$ . Hence, we have shown $\Theta\in\mathscr{T}(s,\alpha)$ with probability one, and so $\pi$ has the proper support.

Writing $P_{\pi}=\int P_{\Theta}\pi(d\Theta)$ to denote the mixture induced by the prior $\pi$ , we have

\displaystyle\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{\tilde{s}\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(\tilde{s},\alpha),\\ ||f||_{2}\geq c\tau_{\text{dense}}(p,\tilde{s},n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}

\displaystyle\geq 1-\frac{1}{2}\sqrt{\chi^{2}(P_{\pi}\,||\,P_{0})}.

For the following calculations, let $\Theta,\Theta^{\prime}\overset{iid}{\sim}\pi$ . Let $\nu,\nu^{\prime}\in\mathcal{V}_{s}$ be the corresponding quantities and let $\alpha,\alpha^{\prime}$ denote the corresponding smoothness levels. Further let $S,S^{\prime}$ denote the corresponding support sets. Note both are of size $s$ . Let $\{r_{i\ell},r^{\prime}_{i\ell}\}_{1\leq i\leq p,\ell\in\mathbb{N}}$ denote an iid collection of $\operatorname{Rademacher}(1/2)$ random variables which is independent of all the other random variables. By the Ingster-Suslina method (Proposition 5), we have

	$\displaystyle\chi^{2}(P_{\pi}\,\|\|\,P_{0})+1$	$\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)$
		$\displaystyle=E\left(\exp\left(c^{2}n\rho_{\nu}\rho_{\nu^{\prime}}\sum_{i\in S\cap S^{\prime}}\sum_{\ell=1}^{\nu\wedge\nu^{\prime}}r_{i\ell}r_{i\ell}^{\prime}\right)\right)$
		$\displaystyle\leq E\left(\exp\left(\frac{c^{4}}{2}n^{2}\rho_{\nu}^{2}\rho_{\nu^{\prime}}^{2}(\nu\wedge\nu^{\prime})\|S\cap S^{\prime}\|\right)\right)$
		$\displaystyle\leq E\left(\exp\left(\frac{c^{4}}{2}n^{2}\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}\cdot\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\|S\cap S^{\prime}\|\right)\right).$

From the definition of $\alpha(\nu,s)$ , we have

	$\displaystyle\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}$	$\displaystyle=\left(\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{8\alpha+4}{4\alpha+1}+\frac{2}{4\alpha+1}}\left(\frac{ns}{\sqrt{p\log\log(np)}}\right)^{-\frac{8\alpha^{\prime}+4}{4\alpha^{\prime}+1}+\frac{2}{4\alpha^{\prime}+1}}\right)^{1/2}$
		$\displaystyle=\left(\frac{\sqrt{p\log\log(np)}}{ns}\right)^{2}$
		$\displaystyle\leq\frac{2\log\left(1+\frac{p\log\log(np)}{s^{2}}\right)}{n^{2}}.$

Here, we have used $s\geq\sqrt{p\log\log(np)}$ since $s\geq p^{1/2+\delta}$ and $s\geq\sqrt{p\log\log n}$ . We have also used the inequality $x/2\leq\log(1+x)$ for $0\leq x\leq 1$ . With this in hand, it follows that

	$\displaystyle\chi^{2}(P_{\pi}\,\|\|\,P_{0})+1$	$\displaystyle\leq E\left(\exp\left(c^{4}\cdot\log\left(1+\frac{p\log\log(np)}{s^{2}}\right)\cdot\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\|S\cap S^{\prime}\|\right)\right)$
		$\displaystyle\leq E\left(\left(1-\frac{s}{p}+\frac{s}{p}e^{c^{4}\left(\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\right)\log\left(1+\frac{p\log\log(np)}{s^{2}}\right)}\right)^{s}\right)$
		$\displaystyle\leq E\left(\exp\left(c^{4}\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\log\log(np)\right)\right).$

Noting that we can write $\nu=2^{k}$ , $\nu^{\prime}=2^{k^{\prime}}$ and that $|\mathcal{V}_{s}|\asymp\log(np)$ , we can follow the same steps as in the proof of Proposition 4.2 in [16]. Taking $c_{\eta}$ suitably small, we obtain $\chi^{2}(P_{\pi}\,||\,P_{0})\leq 4\eta^{2}$ which yields

\inf_{\varphi}\left\{P_{0}\left\{\varphi\neq 0\right\}+\max_{\tilde{s}\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(\tilde{s},\alpha),\\ ||f||_{2}\geq c\tau_{\text{adapt}}(p,\tilde{s},n,\alpha)\end{subarray}}P_{f}\left\{\varphi\neq 1\right\}\right\}\geq 1-\eta

as desired. ∎

Proof of Theorem 6.

Some simplification is convenient. Note $\tau_{\text{sparse}}$ can be rewritten as

\tau^{2}_{\text{sparse}}(p,s,n,\alpha)\asymp\begin{cases}s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log(p\log\log n)\leq n^{\frac{1}{2\alpha+1}},\\ \frac{s\log(p\log\log n)}{n}&\textit{if }\log(p\log\log n)>n^{\frac{1}{2\alpha+1}}.\end{cases}

Note that in the case $\log(p\log\log n)>n^{\frac{1}{2\alpha+1}}$ , we have $\log p\gtrsim n^{\frac{1}{2\alpha+1}}$ . Therefore, $\log(p\log\log n)\asymp\log p$ and so we can further simplify

\tau^{2}_{\text{sparse}}(p,s,n,\alpha)\asymp\begin{cases}s\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{4\alpha}{4\alpha+1}}&\textit{if }\log(p\log\log n)\leq n^{\frac{1}{2\alpha+1}},\\ \frac{s\log p}{n}&\textit{if }\log(p\log\log n)>n^{\frac{1}{2\alpha+1}}.\end{cases}

(46)

Writing $\tau_{\text{sparse}}$ in the form (46) is convenient. The lower bound $\frac{s\log p}{n}$ is exactly the minimax lower bound and so no new argument is needed. All that needs to be proved is the lower bound $s\left(n/\sqrt{\log(p\log\log n)}\right)^{-\frac{4\alpha}{4\alpha+1}}$ .

The proof is very similar to that of the proof of Theorem 5, so we only point out the modifications in the interest of brevity. Fix any $s<p^{1/2-\delta}$ . Let $\pi$ be the prior from the proof of Theorem 5, but use $\mathcal{V}_{\text{sparse}}$ given by (47) and $\alpha(\nu)$ given by (48), defined below, instead. Define the geometric grid

\mathcal{V}_{\text{sparse}}:=\left\{2^{k}:k\in\mathbb{N}\text{ and }\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{\frac{2}{4\alpha_{1}+1}}\leq 2^{k}\leq\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{\frac{2}{4\alpha_{0}+1}}\right\}

(47)

Let $\alpha(\nu)$ denote the solution to

\nu=\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{\frac{2}{4\alpha+1}}.

(48)

Note $\alpha(\nu)\in[\alpha_{0},\alpha_{1}]$ for any $\nu\in\mathcal{V}_{\text{sparse}}$ . Also, use

\rho_{\nu}=\left(\frac{n}{\sqrt{\log(p\log\log n)}}\right)^{-\frac{2\alpha+1}{4\alpha+1}}.

It can be checked in the same manner that $\pi$ with these modifications is properly supported on the alternative hypothesis. We can continue along as in the proof of Theorem 5 up until the point we have

\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}=\left(\frac{\sqrt{\log(p\log\log n)}}{n}\right)^{2}.

From here, we use the fact that $s\leq p^{\frac{1}{2}-\delta}$ implies $\log(p\log\log n)\asymp\log\left(1+\frac{p\log\log n}{s^{2}}\right)$ . In other words, there exists a constant $\kappa>0$ such that

\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}\leq\kappa\frac{\log\left(1+\frac{p\log\log n}{s^{2}}\right)}{n^{2}}.

Then by taking $c_{\eta}$ sufficiently small, the rest of the proof of Theorem 5 can be carried out to obtain the desired result. ∎

7.6 Smoothness and sparsity adaptive upper bounds

Proof of Theorem 4.

Fix $\eta\in(0,1)$ . For ease, let us just write $\mathcal{V}=\mathcal{V}_{\text{test}}$ . We will note throughout the proof where we take $C_{\eta}$ suitably large, so for now let $C>C_{\eta}$ . We will also note where we take $K_{\eta}$ suitably large. We will also note where we take $K_{\eta}$ suitably large. We first examine the Type I error. By union bound and taking $K_{\eta}\geq 1$ ,

	$\displaystyle P_{0}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=1\right\}$	$\displaystyle\leq\sum_{\nu\in\mathcal{V}}P_{0}\left\{\varphi_{\nu}=1\right\}$
		$\displaystyle=\sum_{\nu\in\mathcal{V}}P\left\{\chi^{2}_{\nu p}\geq\nu p+K_{\eta}\left(\sqrt{\nu p\log\log(np)}+\log\log(np)\right)\right\}$
		$\displaystyle\leq\sum_{\nu\in\mathcal{V}}P\left\{\chi^{2}_{\nu p}\geq\nu p+2\left(\sqrt{\nu p\frac{K_{\eta}}{4}\log\log(np)}+\frac{K_{\eta}}{4}\log\log(np)\right)\right\}$
		$\displaystyle\leq\|\mathcal{V}\|e^{-\frac{K_{\eta}}{4}\log\log(np)}$
		$\displaystyle\leq e^{-\left(\frac{K_{\eta}}{4}-\kappa\right)\log\log(np)}$

for some universal positive constant $\kappa$ . Here, we have used $|\mathcal{V}|\asymp\log(np)$ . Taking $K_{\eta}$ suitably large, the Type I error is suitably bounded

P_{0}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=1\right\}\leq\frac{\eta}{2}.

We now examine the Type II error. Fix any $s^{*}\geq p^{1/2+\delta}$ , $\alpha^{*}\in[\alpha_{0},\alpha_{1}]$ , and $f^{*}\in\mathcal{F}(s^{*},\alpha^{*})$ with $||f^{*}||_{2}\geq C\tau_{\text{dense}}(p,s^{*},n,\alpha^{*})$ . Set

\nu^{*}=\left(\frac{ns^{*}}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha^{*}+1}}.

Let $\nu\in\mathcal{V}$ be the smallest element larger than $\nu^{*}$ . Note $\nu/2\leq\nu^{*}\leq\nu$ by definition of $\mathcal{V}$ .

	$\displaystyle P_{f^{*}}\left\{\max_{\tilde{\nu}\in\mathcal{V}}\varphi_{\tilde{\nu}}=0\right\}$
	$\displaystyle\leq P_{f^{*}}\left\{\varphi_{\nu}=0\right\}$
	$\displaystyle=P_{f^{}}\left\{\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\leq\sqrt{\nu p}+K_{\eta}\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}.$

Consider that since $\nu\geq\left(\frac{ns^{*}}{\sqrt{p\log\log(np)}}\right)^{\frac{2}{4\alpha^{*}+1}}$ and $s^{*}\geq p^{1/2+\delta}$ , it immediately follows that $\nu$ grows polynomially in $n$ , and so $\nu\gtrsim\log\log n$ . Therefore, we have $\sqrt{\nu p\log\log(np)}\geq\frac{1}{\kappa^{\prime}}\log\log(np)$ for some universal positive constant $\kappa^{\prime}$ . Then by Chebyshev’s inequality

	$\displaystyle P_{f^{}}\left\{\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\leq\nu p+K_{\eta}\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}$
	$\displaystyle\leq P_{f^{}}\left\{\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\leq\nu p+K_{\eta}(1+\kappa^{\prime})\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}$
	$\displaystyle=P_{f^{}}\left\{(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)})\leq\nu p+n\|\|f^{}\|\|_{2}^{2}-\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\right\}$
	$\displaystyle\leq\frac{\operatorname{Var}\left(\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\right)}{\left(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}$
	$\displaystyle\leq\frac{2\nu p}{\left(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}+\frac{4n\|\|f^{}\|\|_{2}^{2}}{\left(n\|\|f^{*}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}.$

Consider that

	$\displaystyle K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}$	$\displaystyle\leq K_{\eta}(1+\kappa^{\prime})\sqrt{2\nu^{*}p\log\log(np)}$
		$\displaystyle\leq K_{\eta}(1+\kappa^{\prime})\sqrt{2}ns^{}\left(\frac{ns^{}}{\sqrt{p\log\log(np)}}\right)^{-\frac{4\alpha^{}}{4\alpha^{}+1}}$
		$\displaystyle=\sqrt{2}K_{\eta}(1+\kappa^{\prime})n\tau_{\text{dense}}^{2}(p,s^{},n,\alpha^{})$
		$\displaystyle\leq\frac{\sqrt{2}K_{\eta}(1+\kappa^{\prime})}{C^{2}}n\|\|f^{*}\|\|_{2}^{2}.$

Therefore, taking $C_{\eta}$ sufficiently large, we have

	$\displaystyle P_{f^{*}}\left\{\max_{\tilde{\nu}\in\mathcal{V}}\varphi_{\tilde{\nu}}=0\right\}$
	$\displaystyle\leq\frac{2\nu p}{\left(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}+\frac{4n\|\|f^{}\|\|_{2}^{2}}{\left(n\|\|f^{*}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}$
	$\displaystyle\leq\frac{2}{\left(\frac{C^{2}}{\sqrt{2}}-K_{\eta}(1+\kappa^{\prime})\right)^{2}\log\log(np)}+\frac{4}{(1-\frac{\sqrt{2}K_{\eta}(1+\kappa^{\prime})}{C^{2}})n\|\|f^{*}\|\|_{2}^{2}}$
	$\displaystyle\leq\frac{2}{\left(\frac{C^{2}}{\sqrt{2}}-K_{\eta}(1+\kappa^{\prime})\right)^{2}\log\log(np)}+\frac{4}{(1-\frac{\sqrt{2}K_{\eta}(1+\kappa^{\prime})}{C^{2}})\frac{C^{2}}{\sqrt{2}}}$
	$\displaystyle\leq\frac{\eta}{2}.$

Hence, the Type II error is bounded suitably. Since $f^{*}$ was arbitrary, we have shown that

\displaystyle P_{0}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=1\right\}+\max_{s\geq p^{1/2+\delta}}\sup_{\alpha\in[\alpha_{0},\alpha_{1}]}\sup_{\begin{subarray}{c}f\in\mathcal{F}(s,\alpha),\\ ||f||_{2}\geq C\tau_{\text{dense}}(p,s,n,\alpha)\end{subarray}}P_{f}\left\{\max_{\nu\in\mathcal{V}}\varphi_{\nu}=0\right\}\leq\eta

as desired.

∎

References

[1] Ery Arias-Castro, Emmanuel J. Candès and Yaniv Plan “Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism” In Ann. Statist. 39.5, 2011, pp. 2533–2556 DOI: 10.1214/11-AOS910
[2] Yannick Baraud “Non-asymptotic minimax rates of testing in signal detection” In Bernoulli 8.5, 2002, pp. 577–606
[3] Sohom Bhattacharya, Jianqing Fan and Debarghya Mukherjee “Deep neural networks for nonparametric interaction models with diverging dimension” In Ann. Statist., Forthcoming
[4] Lucien Birgé and Pascal Massart “Gaussian model selection” In J. Eur. Math. Soc. (JEMS) 3.3, 2001, pp. 203–268 DOI: 10.1007/s100970100031
[5] Lawrence D. Brown and Mark G. Low “Asymptotic equivalence of nonparametric regression and white noise” In Ann. Statist. 24.6, 1996, pp. 2384–2398 DOI: 10.1214/aos/1032181159
[6] M.. Burnasev “Minimax detection of an imperfectly known signal against a background of Gaussian white noise” In Teor. Veroyatnost. i Primenen. 24.1, 1979, pp. 106–118
[7] Alexandra Carpentier and Nicolas Verzelen “Adaptive estimation of the sparsity in the Gaussian vector model” In Ann. Statist. 47.1, 2019, pp. 93–126 DOI: 10.1214/17-AOS1680
[8] Julien Chhor, Rajarshi Mukherjee and Subhabrata Sen “Sparse signal detection in heteroscedastic Gaussian sequence models: sharp minimax rates” In Bernoulli 30.3, 2024, pp. 2127–2153 DOI: 10.3150/23-bej1667
[9] O. Collier, L. Comminges and A.. Tsybakov “On estimation of nonsmooth functionals of sparse normal means” In Bernoulli 26.3, 2020, pp. 1989–2020 DOI: 10.3150/19-BEJ1180
[10] Olivier Collier, Laëtitia Comminges and Alexandre B. Tsybakov “Minimax estimation of linear and quadratic functionals on sparsity classes” In Ann. Statist. 45.3, 2017, pp. 923–958 DOI: 10.1214/15-AOS1432
[11] Olivier Collier, Laëtitia Comminges, Alexandre B. Tsybakov and Nicolas Verzelen “Optimal adaptive estimation of linear functionals under sparsity” In Ann. Statist. 46.6A, 2018, pp. 3130–3150 DOI: 10.1214/17-AOS1653
[12] L. Comminges, O. Collier, M. Ndaoud and A.. Tsybakov “Adaptive robust estimation in sparse vector model” In Ann. Statist. 49.3, 2021, pp. 1347–1377 DOI: 10.1214/20-aos2002
[13] Nabarun Deb, Rajarshi Mukherjee, Sumit Mukherjee and Ming Yuan “Detecting structured signals in Ising models” In Ann. Appl. Probab. 34.1A, 2024, pp. 1–45 DOI: 10.1214/23-aap1929
[14] David Donoho and Jiashun Jin “Higher criticism for detecting sparse heterogeneous mixtures” In Ann. Statist. 32.3, 2004, pp. 962–994 DOI: 10.1214/009053604000000265
[15] M.. Ermakov “Minimax detection of a signal in Gaussian white noise” In Teor. Veroyatnost. i Primenen. 35.4, 1990, pp. 704–715 DOI: 10.1137/1135098
[16] Chao Gao, Fang Han and Cun-Hui Zhang “On estimation of isotonic piecewise constant signals” In Ann. Statist. 48.2, 2020, pp. 629–654 DOI: 10.1214/18-AOS1792
[17] Ghislaine Gayraud and Yuri Ingster “Detection of sparse additive functions” In Electron. J. Stat. 6, 2012, pp. 1409–1448 DOI: 10.1214/12-EJS715
[18] Peter Hall and Jiashun Jin “Innovated higher criticism for detecting sparse signals in correlated noise” In Ann. Statist. 38.3, 2010, pp. 1686–1732 DOI: 10.1214/09-AOS764
[19] Yanjun Han, Jiantao Jiao and Rajarshi Mukherjee “On estimation of ${L}_{r}$ -norms in Gaussian white noise models” In Probab. Theory Related Fields 177.3-4, 2020, pp. 1243–1294 DOI: 10.1007/s00440-020-00982-x
[20] Trevor Hastie and Robert Tibshirani “Generalized additive models” In Statist. Sci. 1.3, 1986, pp. 297–318
[21] Yu.. Ingster “Adaptive tests for minimax testing of nonparametric hypotheses” In Probability theory and mathematical statistics, Vol. I (Vilnius, 1989) “Mokslas”, Vilnius, 1990, pp. 539–549
[22] Yu.. Ingster “Asymptotically minimax testing of nonparametric hypotheses” In Probability theory and mathematical statistics, Vol. I (Vilnius, 1985) VNU Sci. Press, Utrecht, 1987, pp. 553–574
[23] Yu.. Ingster “Minimax nonparametric detection of signals in white Gaussian noise” In Problemy Peredachi Informatsii 18.2, 1982, pp. 61–73
[24] Yu.. Ingster and I.. Suslina “Nonparametric Goodness-of-Fit Testing Under Gaussian Models” 169, Lecture Notes in Statistics Springer-Verlag, New York, 2003 DOI: 10.1007/978-0-387-21580-8
[25] Yuri Ingster and Oleg Lepski “Multichannel nonparametric signal detection” In Math. Methods Statist. 12.3, 2003, pp. 247–275
[26] Yuri I. Ingster, Alexandre B. Tsybakov and Nicolas Verzelen “Detection boundary in sparse regression” In Electron. J. Stat. 4, 2010, pp. 1476–1526 DOI: 10.1214/10-EJS589
[27] Iain M Johnstone “Gaussian estimation: Sequence and wavelet models”
[28] Iain M. Johnstone “Chi-square oracle inequalities” In State of the art in probability and statistics (Leiden, 1999) 36, IMS Lecture Notes Monogr. Ser. Inst. Math. Statist., Beachwood, OH, 2001, pp. 399–418 DOI: 10.1214/lnms/1215090080
[29] Vladimir Koltchinskii and Ming Yuan “Sparsity in multiple kernel learning” In Ann. Statist. 38.6, 2010, pp. 3660–3695 DOI: 10.1214/10-AOS825
[30] Subhodh Kotekal and Chao Gao “Minimax rates for sparse signal detection under correlation” In Inf. Inference 12.4, 2023, pp. Paper No. iaad044\bibrangessep97 DOI: 10.1093/imaiai/iaad044
[31] B. Laurent and P. Massart “Adaptive estimation of a quadratic functional by model selection” In Ann. Statist. 28.5, 2000, pp. 1302–1338 DOI: 10.1214/aos/1015957395
[32] O. Lepski, A. Nemirovski and V. Spokoiny “On estimation of the ${L}_{r}$ norm of a regression function” In Probab. Theory Related Fields 113.2, 1999, pp. 221–253 DOI: 10.1007/s004409970006
[33] Yi Lin and Hao Helen Zhang “Component selection and smoothing in multivariate nonparametric regression” In Ann. Statist. 34.5, 2006, pp. 2272–2297 DOI: 10.1214/009053606000000722
[34] Haoyang Liu, Chao Gao and Richard J. Samworth “Minimax rates in sparse, high-dimensional change point detection” In Ann. Statist. 49.2, 2021, pp. 1081–1112 DOI: 10.1214/20-aos1994
[35] Lukas Meier, Sara Geer and Peter Bühlmann “High-dimensional additive modeling” In Ann. Statist. 37.6B, 2009, pp. 3779–3821 DOI: 10.1214/09-AOS692
[36] Rajarshi Mukherjee and Gourab Ray “On testing for parameters in Ising models” In Ann. Inst. Henri Poincaré Probab. Stat. 58.1, 2022, pp. 164–187 DOI: 10.1214/21-aihp1157
[37] Rajarshi Mukherjee and Subhabrata Sen “On minimax exponents of sparse testing” arXiv:2003.00570 [math, stat] arXiv, 2020 DOI: 10.48550/arXiv.2003.00570
[38] Garvesh Raskutti, Martin J. Wainwright and Bin Yu “Minimax-optimal rates for sparse additive models over kernel classes via convex programming” In J. Mach. Learn. Res. 13, 2012, pp. 389–427
[39] Pradeep Ravikumar, John Lafferty, Han Liu and Larry Wasserman “Sparse additive models” In J. R. Stat. Soc. Ser. B Stat. Methodol. 71.5, 2009, pp. 1009–1030 DOI: 10.1111/j.1467-9868.2009.00718.x
[40] Markus Reiß “Asymptotic equivalence for nonparametric regression with multivariate and random design” In Ann. Statist. 36.4, 2008, pp. 1957–1982 DOI: 10.1214/07-AOS525
[41] V.. Spokoiny “Adaptive hypothesis testing using wavelets” In Ann. Statist. 24.6, 1996, pp. 2477–2498 DOI: 10.1214/aos/1032181163
[42] N.. Temme “The asymptotic expansion of the incomplete gamma functions” In SIAM J. Math. Anal. 10.4, 1979, pp. 757–766 DOI: 10.1137/0510071
[43] N.. Temme “Uniform asymptotic expansions of the incomplete gamma functions and the incomplete beta function” In Math. Comp. 29.132, 1975, pp. 1109–1114 DOI: 10.2307/2005750
[44] A.. Tsybakov “Pointwise and sup-norm sharp adaptive estimation of functions on the Sobolev classes” In Ann. Statist. 26.6, 1998, pp. 2420–2469 DOI: 10.1214/aos/1024691478
[45] Alexandre B. Tsybakov “Introduction to Nonparametric Estimation”, Springer Series in Statistics Springer, New York, 2009 DOI: 10.1007/b13794
[46] Roman Vershynin “High-Dimensional Probability” 47, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2018 DOI: 10.1017/9781108231596
[47] Martin J. Wainwright “High-Dimensional Statistics” 48, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2019 DOI: 10.1017/9781108627771
[48] Yuting Wei and Martin J. Wainwright “The local geometry of testing in ellipses: tight control via localized Kolmogorov widths” In IEEE Trans. Inform. Theory 66.8, 2020, pp. 5110–5129 DOI: 10.1109/TIT.2020.2981313
[49] Yun Yang and Surya T. Tokdar “Minimax-optimal nonparametric regression in high dimensions” In Ann. Statist. 43.2, 2015, pp. 652–674 DOI: 10.1214/14-AOS1289
[50] Ming Yuan and Ding-Xuan Zhou “Minimax optimal rates of estimation in high dimensional additive models” In Ann. Statist. 44.6, 2016, pp. 2564–2593 DOI: 10.1214/15-AOS1422
[51] Anru R. Zhang and Yuchen Zhou “On the non-asymptotic and sharp lower tail bounds of random variables” In Stat 9, 2020, pp. e314\bibrangessep11 DOI: 10.1002/sta4.314

Appendix A Hard thresholding

In this section, we collect results about the random variable $(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}$ for $Z\sim N(\theta,I_{d})$ and $\alpha_{t}(d)$ defined in (13). We consider the tail $t^{2}\gtrsim d$ and the bulk $t^{2}\lesssim d$ separately.

The proof outlines are similar to those employed in [34]. However, they only consider $d=1$ , whereas we need to consider the general $d$ case. Consequently, a much more careful analysis is required.

A.1 Tail

Lemma 8 (Moment generating function).

Suppose $\widetilde{c}$ is a positive universal constant. There exist universal positive constants $D,\widetilde{C},C,$ and $c$ such that if $d\geq D,t^{2}\geq\widetilde{c}d,$ and $0<\lambda<\frac{1}{2\widetilde{C}}$ , then we have

E\left(e^{\lambda Y}\right)\leq\exp\left(Ct^{4}\lambda^{2}e^{-ct^{2}}\right)

where $Y=\left(||Z||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\left\{||Z||^{2}\geq d+t^{2}\right\}}$ , $Z\sim N(0,I_{d})$ , and where $\alpha_{t}(d)$ is given by (13).

Proof.

We follow the broad approach of the argument presented in the proof of Lemma 18 in [34]. The universal positive constant $D$ will be chosen later on in the proof, so for now let $d\geq D$ and $Z\sim N(0,I_{d})$ . Note we will also take $D$ large enough so that Lemma 21 is applicable. Since $E(Y)=0$ , we have

E(e^{\lambda Y})=E(1+\lambda Y+(e^{\lambda Y}-1-\lambda Y))=1+E(e^{\lambda Y}-1-\lambda Y).

Consider that for any $y\in\mathbb{R}$ we have

e^{y}-1-y\leq\begin{cases}(-y)\wedge y^{2}&\textit{if }y<0,\\ y^{2}&\textit{if }0\leq y\leq 1,\\ e^{y}&\textit{if }y>1.\end{cases}

Therefore,

E(e^{\lambda Y}-1-\lambda Y)\leq\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})+\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0\leq Y\leq 1/\lambda\}})+E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}}).

(49)

Each term will be bounded separately. Considering the first term, note that Lemma 22 asserts $\alpha_{t}(d)\leq d+C^{*}(t^{2}\vee d)$ for a universal constant $C^{*}$ . Note also that $\alpha_{t}(d)\geq d+t^{2}$ . Therefore,

	$\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})$	$\displaystyle=\lambda^{2}E((\|\|Z\|\|^{2}-\alpha_{t}(d))^{2}\mathbbm{1}_{\{\alpha_{t}(d)>\|\|Z\|\|^{2}\geq d+t^{2}\}})$
		$\displaystyle\leq\lambda^{2}(\alpha_{t}(d)-d-t^{2})^{2}P\left\{\alpha_{t}(d)>\|\|Z\|\|^{2}\geq d+t^{2}\right\}$
		$\displaystyle\leq\lambda^{2}(d+C^{*}(t^{2}\vee d)-d-t^{2})^{2}P\left\{\|\|Z\|\|^{2}\geq d+t^{2}\right\}$
		$\displaystyle\leq 2C^{*}\lambda^{2}(d^{2}\vee t^{4})\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle=2C^{*}\lambda^{2}t^{4}\exp\left(-ct^{2}\right)$

where $c$ is a universal positive constant and $C^{*}$ remains a universal constant but whose value may change from line to line. Note we have also used Corollary 1 in the above display as well as $t^{2}\geq\widetilde{c}d$ . To summarize, we have shown

\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})\leq 2C^{*}\lambda^{2}t^{4}\exp\left(-ct^{2}\right).

(50)

We now bound the second term in (49). Let $f_{d}$ denote the probability density function of the $\chi^{2}_{d}$ distribution. We have

	$\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0<Y<1/\lambda\}})$	$\displaystyle\leq\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0<Y\}})$
		$\displaystyle=\lambda^{2}E((\|\|Z\|\|^{2}-\alpha_{t}(d))^{2}\mathbbm{1}_{\{\|\|Z\|\|^{2}>\alpha_{t}(d)\}})$
		$\displaystyle=\lambda^{2}\int_{\alpha_{t}(d)}^{\infty}(z-\alpha_{t}(d))^{2}f_{d}(z)\,dz$
		$\displaystyle\leq 2\lambda^{2}\int_{\alpha_{t}(d)}^{\infty}\left(z^{2}+\alpha_{t}(d)^{2}\right)f_{d}(z)\,dz.$

An application of Lemma 18 yields

	$\displaystyle\int_{\alpha_{t}(d)}^{\infty}z^{2}f_{d}(z)\,dz$	$\displaystyle=d(d+2)P\left\{\chi^{2}_{d+4}\geq\alpha_{t}(d)\right\}$
		$\displaystyle\leq 3d^{2}P\left\{\chi^{2}_{d+4}\geq d+4+(\alpha_{t}(d)-d-4)\right\}$
		$\displaystyle\leq 3d^{2}P\left\{\chi^{2}_{d+4}\geq d+4+(t^{2}-4)\right\}$
		$\displaystyle\leq 6d^{2}\exp\left(-c\min\left(\frac{(t^{2}-4)^{2}}{d+4},(t^{2}-4)\right)\right)$
		$\displaystyle\leq 6d^{2}\exp\left(-c\min\left(\frac{t^{4}}{20d},\frac{t^{2}}{2}\right)\right)$
		$\displaystyle\leq 6d^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle=6d^{2}\exp\left(-ct^{2}\right)$

where $c$ remains a universal constant but has a value which may change from line to line. Note we have used $\alpha_{t}(d)\geq d+t^{2}$ , Corollary 1, and $t^{2}\geq\widetilde{c}d$ . Likewise, consider that

	$\displaystyle\int_{\alpha_{t}(d)}^{\infty}\alpha_{t}(d)^{2}f_{d}(z)\,dz$	$\displaystyle=\alpha_{t}(d)^{2}P\{\chi^{2}_{d}\geq\alpha_{t}(d)\}$
		$\displaystyle\leq\alpha_{t}(d)^{2}P\{\chi^{2}_{d}\geq d+t^{2}\}$
		$\displaystyle\leq 2(d+C^{*}t^{2})^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle\leq C^{*}t^{4}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle=C^{*}t^{4}\exp\left(-ct^{2}\right)$

where we have used $(a+b)^{2}\leq 2a^{2}+2b^{2}$ , $t^{2}\geq\widetilde{c}d$ , Lemma 22, and Corollary 1. Again, $C^{*}$ and $c$ remain universal constants but have values which may change from line to line. Thus, we have shown

\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})\leq C^{*}\lambda^{2}t^{4}\exp\left(-ct^{2}\right).

(51)

We now bound the final term in (49). Consider that for $0\leq\lambda<\frac{1}{2}$ we have

E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}})=\int_{\alpha_{t}(d)+1/\lambda}^{\infty}e^{\lambda(z-\alpha_{t}(d))}f_{d}(z)\,dz=e^{-\lambda\alpha_{t}(d)}\int_{\alpha_{t}(d)+1/\lambda}^{\infty}\frac{1}{2^{d/2}\Gamma(d/2)}e^{-\left(\frac{1}{2}-\lambda\right)z}z^{\frac{d}{2}-1}\,dz.

Let $u=\left(\frac{1}{2}-\lambda\right)z$ . Then $du=\left(\frac{1}{2}-\lambda\right)dz$ and so

	$\displaystyle e^{-\lambda\alpha_{t}(d)}\int_{\alpha_{t}(d)+1/\lambda}^{\infty}\frac{1}{2^{d/2}\Gamma(d/2)}e^{-\left(\frac{1}{2}-\lambda\right)z}z^{\frac{d}{2}-1}\,dz$	$\displaystyle=\frac{e^{-\lambda\alpha_{t}(d)}}{2^{d/2}\Gamma(d/2)}\int_{\left(\alpha_{t}(d)+1/\lambda\right)(1/2-\lambda)}^{\infty}e^{-u}u^{\frac{d}{2}-1}\left(\frac{1}{2}-\lambda\right)^{-\frac{d}{2}}\,du$
		$\displaystyle=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{1}{\Gamma\left(\frac{d}{2}\right)}\int_{(\alpha_{t}(d)+1/\lambda)(1/2-\lambda)}^{\infty}e^{-u}u^{\frac{d}{2}-1}\,du$
		$\displaystyle=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}$

where $\Gamma(s,x)=\int_{x}^{\infty}e^{-t}t^{s-1}\,dt$ . To summarize, we have shown

E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}})=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}.

(52)

To continue, we would like to apply Corollary 2, but we must first verify the conditions. Let $a=\frac{d}{2}$ and $\eta=\sqrt{2(\mu-\log(1+\mu))}$ with $\mu=\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)}{a}-1$ . Note we can take $D$ to be a sufficiently large universal constant in order to ensure $a$ is sufficiently large. Since $\widetilde{c}d\leq t^{2}$ , it follows from Lemma 22 that $\alpha_{t}(d)\leq\widetilde{C}t^{2}$ for some positive universal constant $\widetilde{C}$ . Without loss of generality, we can take $\widetilde{C}\geq\frac{3}{2}$ . Let us restrict our attention to $\lambda<\frac{1}{2\widetilde{C}}$ . Observe that

\mu=\frac{\frac{1}{\lambda}+\alpha_{t}(d)-2-2\lambda\alpha_{t}(d)-d}{d}\geq\frac{(2\widetilde{C}-2)+t^{2}-\frac{2}{2\widetilde{C}}(\widetilde{C}t^{2})}{d}\geq\frac{2(\widetilde{C}-1)}{d}=\frac{C^{**}}{d}

where we have defined $C^{**}=2(\widetilde{C}-1)$ . Since we have shown $\mu>0$ , we can apply Corollary 2. By Corollary 2 we have

\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{C^{*}}{\sqrt{d}}\exp\left(-\frac{a\eta^{2}}{2}\right).

Here, $\Phi$ denotes the cumulative distribution function of the standard normal distribution and $C^{*}$ is a universal constant. By the Gaussian tail bound $1-\Phi(x)\leq\frac{\varphi(x)}{x}$ for $x>0$ where $\varphi=\Phi^{\prime}$ , we have

\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{*}\exp\left(-\frac{a\eta^{2}}{2}\right)\left(\frac{1}{\eta\sqrt{a}}+\frac{1}{\sqrt{d}}\right)

where the value of $C^{*}$ has changed from line to line but remains a universal constant. To continue the calculation, we need to bound $\eta\sqrt{a}$ from below. Note we can take $D\geq C^{**}$ . For $\lambda<\frac{1}{2\widetilde{C}}$ , it follows from $\frac{C^{**}}{d}\leq 1$ and that

	$\displaystyle\eta\sqrt{a}$	$\displaystyle=\sqrt{a}\sqrt{2(\mu-\log(1+\mu))}$
		$\displaystyle\geq\sqrt{a}\sqrt{2\left(\frac{C^{}}{d}-\log\left(1+\frac{C^{}}{d}\right)\right)}$
		$\displaystyle\geq\sqrt{a}\sqrt{2\cdot\left(\frac{(C^{})^{2}}{2d}-\frac{(C^{})^{3}}{3d^{3}}\right)}$
		$\displaystyle=\sqrt{\frac{(C^{})^{2}}{2d}-\frac{(C^{})^{3}}{3d^{2}}}$
		$\displaystyle=\frac{C^{}}{\sqrt{d}}\sqrt{\frac{1}{2}-\frac{C^{}}{3d}}$
		$\displaystyle\geq\frac{C^{**}}{\sqrt{6d}}.$

Consequently, we have the bound

\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{*}\sqrt{d}\exp\left(-\frac{a\eta^{2}}{2}\right)

where the value of $C^{*}$ has changed but remains a universal constant. We now examine the term $e^{-a\eta^{2}/2}$ . Consider that

	$\displaystyle\exp\left(-\frac{a\eta^{2}}{2}\right)$	$\displaystyle=\exp\left(-a\left(\mu-\log\left(1+\mu\right)\right)\right)$
		$\displaystyle=\exp\left(-a\mu\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}$
		$\displaystyle=\exp\left(\frac{d}{2}-\frac{1}{2\lambda}+1-\frac{\alpha_{t}(d)}{2}+\lambda\alpha_{t}(d)\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}.$

Therefore, letting the value of $C^{*}$ change from line to line but remaining a universal constant, we have from (52)

	$\displaystyle E\left(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}}\right)$	$\displaystyle\leq C^{*}\sqrt{d}\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\exp\left(\frac{d}{2}-\frac{1}{2\lambda}+1-\frac{\alpha_{t}(d)}{2}+\lambda\alpha_{t}(d)\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}$
		$\displaystyle\leq C^{*}\sqrt{d}e^{-\left(\frac{1}{2\lambda}+\frac{\alpha_{t}(d)-d}{2}\right)}\left(1+\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)^{d/2}$
		$\displaystyle=C^{*}\sqrt{d}\exp\left(-\left(\frac{1}{2\lambda}+\frac{\alpha_{t}(d)-d}{2}\right)+\frac{d}{2}\log\left(1+\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)\right)$
		$\displaystyle=C^{*}\sqrt{d}\exp\left(-\frac{d}{2}\left(\left(\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)-\log\left(1+\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\right)\right)\right)$
		$\displaystyle\leq C^{}\sqrt{d}\exp\left(-\frac{d}{2}\cdot\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{c^{*}d}\right)$
		$\displaystyle\leq C^{}\sqrt{d}\exp\left(-\frac{1}{2c^{}\lambda}-\frac{t^{2}}{2c^{*}}\right)$

for a universal positive constant $c^{**}$ . We have used that $u-\log(1+u)\gtrsim u$ for $u\geq 1$ . Note that we can use this since $\frac{\frac{1}{\lambda}+\alpha_{t}(d)-d}{d}\geq\frac{t^{2}}{d}\gtrsim 1$ . Since $e^{-\frac{1}{2c^{**}u}}\lesssim u^{2}$ for all $u\in\mathbb{R}$ , it follows that

E\left(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}}\right)\leq C^{*}\sqrt{d}\lambda^{2}\exp\left(-\frac{t^{2}}{2c^{**}}\right)

(53)

where the value of $C^{*}$ , again, has changed but remains a universal constant. Putting together our bounds (50), (51), (53) into (49) yields

E(e^{\lambda Y})\leq 1+Ct^{4}\lambda^{2}e^{-ct^{2}}\leq\exp\left(Ct^{4}\lambda^{2}e^{-ct^{2}}\right)

for $\lambda<\frac{1}{2\widetilde{C}}$ where $C,c>0$ are universal constants. The proof is complete. ∎

Lemma 9.

Let $Z\sim N(\theta,I_{d})$ . Suppose $\widetilde{c}$ is a universal positive constant. There exists a universal constant $C$ such that for every $t^{2}\geq\widetilde{c}d$ , we have

E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}\begin{cases}=0&\textit{if }\theta=0,\\ \geq 0&\textit{if }||\theta||^{2}<Ct^{2},\\ \geq||\theta||^{2}/2&\textit{if }||\theta||^{2}\geq Ct^{2}.\end{cases}

Here, $\alpha_{t}(d)$ is given by (13).

Proof.

We will make a choice for $C$ at the end of the proof. Consider first the case where $\theta=0$ . Then $E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}=0$ by definition of $\alpha_{t}(d)$ and so we have the desired result. The second case follows since the expression inside the expectation is stochastically increasing in $||\theta||$ . Moving on to the final case, suppose $||\theta||^{2}\geq Ct^{2}$ . Since $\alpha_{t}(d)\geq d+t^{2}$ , we have

E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}}\right\}\leq 0

Consequently,

	$\displaystyle E\left\{(\|\|Z\|\|^{2}-\alpha_{t}(d))\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}}\right\}$	$\displaystyle=E\left\{(\|\|Z\|\|^{2}-\alpha_{t}(d))(1-\mathbbm{1}_{\{\|\|Z\|\|^{2}<d+t^{2}\}})\right\}$
		$\displaystyle=d+\|\|\theta\|\|^{2}-\alpha_{t}(d)-E\left\{(\|\|Z\|\|^{2}-\alpha_{t}(d))\mathbbm{1}_{\{\|\|Z\|\|^{2}<d+t^{2}\}}\right\}$
		$\displaystyle\geq d+\|\|\theta\|\|^{2}-\alpha_{t}(d)$

By Lemma 22 and $t^{2}\gtrsim d$ , there exists a universal positive constant $C^{*}$ such that $\alpha_{t}(d)\leq d+C^{*}t^{2}$ . Therefore,

\displaystyle E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}\geq||\theta||^{2}-C^{*}t^{2}\geq||\theta||^{2}\left(1-\frac{C^{*}}{C}\right).

Selecting $C=\frac{C^{*}}{2}$ yields the desired result. ∎

Lemma 10.

Let $Z\sim N(\theta,I_{d})$ . If $t^{2}\geq\widetilde{c}d$ for a universal positive constant $\widetilde{c}$ , then

\operatorname{Var}\left((||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right)\lesssim\begin{cases}t^{4}\exp\left(-c^{*}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)&\textit{if }\theta=0,\\ t^{4}&\textit{if }0<||\theta||<2t,\\ ||\theta||^{2}&\textit{if }||\theta||>2t.\end{cases}

Here, $\alpha_{t}(d)$ is given by (13) and $c^{*}$ is a universal positive constant.

Proof.

The proof of Lemma 14 can be repeated with the modification of invoking Lemma 22 instead of Lemma 21. ∎

Lemma 11.

Suppose $\widetilde{c}$ is a universal positive constant. There exist universal positive constants $D$ and $C^{*}$ such that if $d\geq D$ , $t^{2}\geq\widetilde{c}d$ , and $x>0$ , then

P\left\{\sum_{i=1}^{n}\left(||Z_{i}||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\geq C^{*}\left(\sqrt{xnt^{4}e^{-ct^{2}}}+x\right)\right\}\leq e^{-x}

where $Z_{1},...,Z_{n}\overset{iid}{\sim}N(0,I_{d})$ and $\alpha_{t}(d)$ is given by (13). Here, $c$ is a universal positive constant.

Proof.

Let $D$ be the constant from Lemma 8. We use the Chernoff method to obtain the desired bound. Let $Y$ be as in Lemma 8 and let $\widetilde{C}$ be the universal constant from Lemma 8. For any $u>0$ , we have by Lemma 8

	$\displaystyle P\left\{\sum_{j=1}^{n}\left(\|\|Z_{j}\|\|^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\{\|\|Z_{j}\|\|^{2}\geq d+t^{2}\}}>u\right\}$	$\displaystyle\leq\inf_{\lambda<\frac{1}{2\widetilde{C}}}e^{-\lambda u}\left(E(e^{\lambda Y})\right)^{n}$
		$\displaystyle\leq\inf_{\lambda<\frac{1}{2\widetilde{C}}}\exp\left(-\lambda u+Cnt^{4}\lambda^{2}e^{-ct^{2}}\right)$
		$\displaystyle=\exp\left(-C\left(\frac{u^{2}}{nt^{4}e^{-ct^{2}}}\wedge u\right)\right)$

where $c>0$ is a universal constant and $C>0$ is a universal constant whose value may change from line to line but remain a universal constant. Selecting $u=\frac{1}{2}\left(\sqrt{\frac{xnt^{4}e^{-ct^{2}}}{C}}+\frac{x}{C}\right)$ and selecting $C^{*}$ suitably yields the desired result. The proof is complete. ∎

A.2 Bulk

Lemma 12.

Let $L^{*}$ be the universal positive constant from Lemma 21. There exist universal constants $C^{*},C^{**},C,c>0$ such that if $d\geq C^{**}$ and $1\leq\beta\leq L^{*}\sqrt{d}$ , then

E(e^{\lambda Y})\leq\exp\left(Cd\beta^{2}\lambda^{2}e^{-c\beta^{2}}\right)

for $\lambda<\frac{\beta}{2(\beta C^{*}+\sqrt{d})}$ . Here, $Y=(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}$ with $t^{2}=\beta\sqrt{d}$ and $Z\sim N(0,I_{d})$ . Also, $\alpha_{t}(d)$ is given by (13).

Proof.

The argument we give will be similar to the proof of Lemma 8, namely we will separately bound each term in (49) and substitute into the equation

E(e^{\lambda Y})=1+E\left(e^{\lambda Y}-1-\lambda Y\right).

We start with the first term on the right-hand side in (49). By Lemma 18, we have $\alpha_{t}(d)\leq d+C^{*}\beta\sqrt{d}$ for a universal positive constant $C^{*}$ . Note that $C^{*}\geq 1$ since we trivially have $\alpha_{t}(d)\geq d+t^{2}=d+\beta\sqrt{d}$ . Consequently, arguing as in the proof of Lemma 8, we have

	$\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{Y<0\}})$	$\displaystyle\leq\lambda^{2}(d+t-\alpha_{t}(d))^{2}P\{\|\|Z\|\|^{2}\geq d+t^{2}\}$
		$\displaystyle\leq C\lambda^{2}d\beta^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle=C\lambda^{2}d\beta^{2}e^{-c\beta^{2}}$

where $C,c>0$ are universal constants. We have used Corollary 1 to obtain the final inequality. We now bound the second term in (49). Letting $f_{d}$ denote the probability density function of the $\chi^{2}_{d}$ distribution, we can repeat and then continue the calculation in the proof of Lemma 8 to obtain

	$\displaystyle\lambda^{2}E(Y^{2}\mathbbm{1}_{\{0<Y<1/\lambda\}})$	$\displaystyle\leq\lambda^{2}\int_{\alpha_{t}(d)}^{\infty}(z-\alpha_{t}(d))^{2}f_{d}(z)\,dz$
		$\displaystyle\leq\lambda^{2}\int_{d+t^{2}}^{\infty}(z-\alpha_{t}(d))^{2}f_{d}(z)\,dz$
		$\displaystyle=\lambda^{2}\left(\int_{d+t^{2}}^{\infty}z^{2}f_{d}(z)\,dz-2\alpha_{t}(d)\int_{d+t^{2}}^{\infty}zf_{d}(z)\,dz+\alpha_{t}(d)^{2}\int_{d+t^{2}}f_{d}(z)\,dz\right)$
		$\displaystyle=\lambda^{2}P\{\chi^{2}_{d}\geq d+t^{2}\}\operatorname{Var}\left(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2}\right)$
		$\displaystyle\leq C\lambda^{2}d\beta^{2}\exp\left(-c\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle=C\lambda^{2}d\beta^{2}e^{-c\beta^{2}}$

where the values of $C,c>0$ have changed but remain universal constants. We have used Lemma 24 here.

We now bound the final term in (49). Arguing exactly as in the proof of Lemma 8, we have

E(e^{\lambda Y}\mathbbm{1}_{\{Y>1/\lambda\}})=\frac{e^{-\lambda\alpha_{t}(d)}}{(1-2\lambda)^{d/2}}\cdot\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}

for $0\leq\lambda<\frac{1}{2}$ . Recall that $\Gamma(s,x)=\int_{x}^{\infty}e^{-t}t^{s-1}\,dt$ .

Let us restrict our attention to $\lambda<\frac{\beta}{2(\sqrt{d}+C^{*}\beta)}$ . We seek to apply Corollary 2, but we must verify the conditions. Let $a=\frac{d}{2}$ and $\eta=\sqrt{2(\mu-\log\left(1+\mu\right))}$ with $\mu=\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)}{a}-1$ . Observe

\mu=\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)-d}{d}=\frac{\frac{1-2\lambda}{\lambda}+\alpha_{t}(d)(1-2\lambda)-d}{d}.

Consider that $\lambda<\frac{\beta}{2(\sqrt{d}+C^{*}\beta)}=\frac{t^{2}}{2(d+C^{*}\beta\sqrt{d})}\leq\frac{\alpha_{t}(d)-d}{2\alpha_{t}(d)}$ . Therefore, $\alpha_{t}(d)(1-2\lambda)-d\geq 0$ and so

\mu\geq\frac{1-2\lambda}{\lambda d}\geq\frac{2}{\beta\sqrt{d}}.

Note we have used $C^{*}\geq 1$ . Since $\mu>0$ and $d\geq C^{**}$ which is a sufficiently large universal constant, we can apply Corollary 2. By Corollary 2, we have

\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{C^{{\dagger}}}{\sqrt{d}}\exp\left(-\frac{a\eta^{2}}{2}\right).

Here, $C^{{\dagger}}$ is a positive universal constant. Recall that $\Phi$ denotes the cumulative distribution function of the standard normal distribution. By the Gaussian tail bound $1-\Phi(x)\leq\frac{\varphi(x)}{x}$ for $x>0$ , we have

\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{{\dagger}}\exp\left(-\frac{a\eta^{2}}{2}\right)\left(\frac{1}{\eta\sqrt{a}}+\frac{1}{\sqrt{d}}\right)

(54)

where the value of $C^{{\dagger}}$ has changed but remains a universal constant. To continue with the bound, we need to bound $\eta\sqrt{a}$ from below. Let us take $C^{**}$ larger than $4$ . Since $a=\frac{d}{2}$ , we have

	$\displaystyle\eta\sqrt{a}$	$\displaystyle=\sqrt{a}\sqrt{2(\mu-\log(1+\mu))}$
		$\displaystyle\geq\sqrt{a}\sqrt{2\left(\frac{2}{\beta\sqrt{d}}+\log\left(1+\frac{2}{\beta\sqrt{d}}\right)\right)}$
		$\displaystyle\geq\sqrt{a}\sqrt{2\left(\frac{4}{2\beta^{2}d}-\frac{8}{3\beta^{3}d^{3/2}}\right)}$
		$\displaystyle\geq\frac{c^{*}}{\beta}$

where $c^{*}>0$ is a universal constant. We can conclude from (54) that

\frac{\Gamma\left(\frac{d}{2},\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(\frac{1}{2}-\lambda\right)\right)}{\Gamma\left(\frac{d}{2}\right)}\leq C^{{\dagger}}\beta\exp\left(-\frac{a\eta^{2}}{2}\right)

where the value of $C^{\dagger}$ has changed but remains a universal constant. We now examine the term $e^{-\frac{a\eta^{2}}{2}}$ . Arguing exactly as in the proof of Lemma 8, we obtain

\exp\left(-\frac{a\eta^{2}}{2}\right)=\exp\left(\frac{d}{2}-\frac{1}{2\lambda}+1-\frac{\alpha_{t}(d)}{2}+\lambda\alpha_{t}(d)\right)\left(\frac{\left(\frac{1}{\lambda}+\alpha_{t}(d)\right)\left(1-2\lambda\right)}{d}\right)^{d/2}

which, as argued in the proof of Lemma 8, leads us to the bound

E(e^{\lambda Y}\mathbbm{1}_{Y>1/\lambda})\leq C^{*}\beta\exp\left(C\lambda^{2}d\beta^{2}e^{-c\beta^{2}}\right)

for $\lambda<\frac{\beta}{2(\sqrt{d}+C^{*}\beta)}$ as desired. ∎

Lemma 13.

Let $Z\sim N(\theta,I_{d})$ . Suppose $1\leq\beta\leq L^{*}\sqrt{d}$ where $L^{*}$ is the universal constant from Lemma 21. There exists a universal positive constant $C$ such that if $t^{2}=\beta\sqrt{d}$ , then we have

E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right\}\begin{cases}=0&\textit{if }\theta=0,\\ \geq 0&\textit{if }||\theta||^{2}<Ct^{2},\\ \geq||\theta||^{2}/2&\textit{if }||\theta||^{2}\geq Ct^{2}\end{cases}

where $\alpha_{t}(d)$ is given by (13).

Proof.

We will make a choice for $C$ later on in the proof. The proof for the cases $\theta=0$ and $||\theta||^{2}\leq Ct^{2}$ follows exactly as in the proof of Lemma 9. Here, we focus on the final case in which $||\theta||^{2}\geq Ct^{2}$ . Since $\alpha_{t}(d)\geq d+t^{2}$ , we have

E\left\{(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}<d+t^{2}\}}\right\}\leq 0.

Since $t^{2}=\beta\sqrt{d}$ , we can apply Lemma 21 to obtain

	$\displaystyle E\left\{(\|\|Z\|\|^{2}-\alpha_{t}(d))\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}}\right\}$	$\displaystyle=E\left\{(\|\|Z\|\|^{2}-\alpha_{t}(d))(1-\mathbbm{1}_{\{\|\|Z\|\|^{2}<d+t^{2}\}})\right\}$
		$\displaystyle\geq d+\|\|\theta\|\|^{2}-\alpha_{t}(d)-E\left\{(\|\|Z\|\|^{2}-\alpha_{t}(d))\mathbbm{1}_{\{\|\|Z\|\|^{2}<d+t^{2}\}}\right\}$
		$\displaystyle\geq\|\|\theta\|\|^{2}-C^{*}\beta\sqrt{d}$
		$\displaystyle=\|\|\theta\|\|^{2}\left(1-\frac{C^{*}\beta\sqrt{d}}{\|\|\theta\|\|^{2}}\right)$
		$\displaystyle=\|\|\theta\|\|^{2}\left(1-\frac{C^{*}t^{2}}{\|\|\theta\|\|^{2}}\right)$
		$\displaystyle\geq\|\|\theta\|\|^{2}\left(1-\frac{C^{*}}{C}\right).$

where $C^{*}$ is a universal positive constant. Taking $C=\frac{C^{*}}{2}$ completes the proof. ∎

Lemma 14.

Let $Z\sim N(\theta,I_{d})$ . Suppose $1\leq\beta\leq L^{*}\sqrt{d}$ where $L^{*}$ is the universal constant from Lemma 21. Then there exists a universal positive constant $c^{*}$ such that if $t^{2}=\beta\sqrt{d}$ , then

\operatorname{Var}\left((||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}\right)\lesssim\begin{cases}t^{4}\exp\left(-\frac{c^{*}t^{4}}{d}\right)&\textit{if }\theta=0,\\ d+||\theta||^{2}&\textit{if }||\theta||\geq 2t,\\ t^{4}&\textit{if }0<||\theta||<2t.\end{cases}

Here, $\alpha_{t}(d)$ is given by (13).

Proof.

Let $f_{d}$ and $F_{d}$ respectively denote the probability density function and cumulative distribution function of the $\chi^{2}_{d}$ distribution.

Case 1: Consider the first case in which $\theta=0$ . Then

	$\displaystyle\operatorname{Var}((\|\|Z\|\|^{2}-\alpha_{t}(d))\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}})$	$\displaystyle\leq E\left((\|\|Z\|\|^{2}-\alpha_{t}(d))^{2}\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}}\right)$
		$\displaystyle=P\left\{\|\|Z\|\|^{2}\geq d+t^{2}\right\}\cdot E\left((\|\|Z\|\|^{2}-\alpha_{t}(d))^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2}\right)$
		$\displaystyle=\left(1-F_{d}(d+t^{2})\right)\operatorname{Var}\left(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2}\right)$
		$\displaystyle\leq C^{*}\left(1-F_{d}(d+t^{2})\right)d\beta^{2}$

where we have applied the definition of $\alpha_{t}(d)$ and we have applied Lemma 24. Here, $C^{*}$ is a universal positive constant. An application of Corollary 1 gives the desired result for this case.

We now move to the other two cases. Suppose $\theta\neq 0$ . For ease of notation, let $Y=(||Z||^{2}-\alpha_{t}(d))\mathbbm{1}_{\{||Z||^{2}\geq d+t^{2}\}}$ . Observe

	$\displaystyle\operatorname{Var}(Y)$
	$\displaystyle=E\left(\operatorname{Var}(Y\,\|\,\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}})\right)+\operatorname{Var}\left(E(Y\,\|\,\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}})\right)$
	$\displaystyle\leq P\left\{\|\|Z\|\|^{2}\geq d+t^{2}\right\}\operatorname{Var}\left(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2}\right)+P\left\{\|\|Z\|\|^{2}\geq d+t^{2}\right\}\left(E(\|\|Z\|\|^{2}-\alpha_{t}(d)\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})\right)^{2}.$		(55)

Note that $0\leq E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2})$ since $\theta\neq 0$ (by an appeal to stochastic ordering), and so we can find upper bounds for the square of this conditional expectation by first finding upper bounds on the conditional expectation. We examine each term separately in the above display. First, consider

$\displaystyle P\left\{\|\|Z\|\|^{2}\geq d+t^{2}\right\}\operatorname{Var}\left(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2}\right)$	$\displaystyle\leq E\left(\operatorname{Var}\left(\|\|Z\|\|^{2}\,\|\,\mathbbm{1}_{\{\|\|Z\|\|^{2}\geq d+t^{2}\}}\right)\right)$
	$\displaystyle\leq\operatorname{Var}(\|\|Z\|\|^{2})$
	$\displaystyle=2d+4\|\|\theta\|\|^{2}.$	(56)

We now examine the second term. Note we can write $Z=g+\theta$ where $g\sim N(0,I_{d})$ . Therefore,

$\displaystyle E(\|\|Z\|\|^{2}-\alpha_{t}(d)\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle=E\left(\|\|g+\theta\|\|^{2}-\alpha_{t}(d)\,\|\,\|\|g+\theta\|\|^{2}\geq d+t^{2}\right)$
	$\displaystyle\leq E(\|\|g\|\|^{2}+2\langle\theta,g\rangle+\|\|\theta\|\|^{2}-d-t^{2}\,\|\,\|\|g+\theta\|\|^{2}\geq d+t^{2})$
	$\displaystyle=E(\|\|g\|\|^{2}-d\,\|\,\|\|g+\theta\|\|^{2}\geq d+t^{2})+\|\|\theta\|\|^{2}-t^{2}$	(57)

where we have used that $\langle\theta,g\rangle\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}}\overset{d}{=}\langle\theta,-g\rangle\mathbbm{1}_{\{||-g+\theta||^{2}\geq d+t^{2}\}}$ . With this in hand, we have $E(\langle\theta,g\rangle\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}})=0$ , which further implies $E(\langle\theta,g\rangle\,|\,||g+\theta||^{2}\geq d+t^{2})=0$ . Note we have also used $\alpha_{t}(d)\geq d+t^{2}$ to obtain the second line in the above display. With the above display in hand, we now examine the remaining two cases.

Case 2: Consider the case $||\theta||\geq 2t$ . Observe that

E(||g||^{2}-d\,|\,||g+\theta||^{2}\geq d+t^{2})=\frac{E\left((||g||^{2}-d)\mathbbm{1}_{\{||g+\theta||^{2}\geq d+t^{2}\}}\right)}{P\left\{||g+\theta||^{2}\geq d+t^{2}\right\}}.

Examining the denominator, since $||g+\theta||^{2}\sim\chi^{2}_{d}(||\theta||^{2})\overset{d}{=}\chi^{2}_{d-1}+\chi^{2}_{1}(||\theta||^{2})$ where the two $\chi^{2}$ -variates on the right hand side are independent, we have

P\{||g+\theta||^{2}\geq d+t^{2}\}\geq P\{\chi^{2}_{d-1}\geq d-1\}P\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}\geq cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}

where $c$ is a universal positive constant. Examining the numerator, consider that by Lemma 18 we have

	$\displaystyle E\left((\|\|g\|\|^{2}-d)\mathbbm{1}_{\{\|\|g+\theta\|\|^{2}\geq d+t^{2}\}}\right)$	$\displaystyle\leq E((\|\|g\|\|^{2}-d)\mathbbm{1}_{\{\|\|g\|\|^{2}\geq d\}})$
		$\displaystyle=\int_{d}^{\infty}(z-d)f_{d}(z)\,dz$
		$\displaystyle=d\int_{d}^{\infty}(f_{d+2}(z)-f_{d}(z))\,dz$
		$\displaystyle=-2d\int_{d}^{\infty}f_{d+2}^{\prime}(z)\,dz$
		$\displaystyle=2df_{d+2}(d).$

Hence, from (57), we have shown

E(||Z||^{2}-\alpha_{t}(d)\,|\,||Z||^{2}\geq d+t^{2})\leq\frac{2df_{d+2}(d)}{cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}}+||\theta||^{2}-t^{2}.

Thus, we have the bound

\displaystyle\operatorname{Var}\left(Y\right)\leq 2d+4||\theta||^{2}+P\{||Z||^{2}\geq d+t^{2}\}\left(\frac{2df_{d+2}(d)}{cP\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}}+||\theta||^{2}-t^{2}\right)^{2}.

Consider that

\frac{2df_{d+2}(d)}{c}=\frac{d}{c\Gamma\left(\frac{d}{2}+1\right)}\left(\frac{d}{2e}\right)^{d/2}\leq c\sqrt{d}

where the value of $c$ can change in each expression but remains a universal positive constant. The final inequality follows from Stirling’s formula. Moreover, since $||\theta||\geq 2t$ , there exists a positive universal constant $c^{\prime}$ such that $P\{\chi^{2}_{1}(||\theta||^{2})\geq 1+t^{2}\}\geq 1/c^{\prime}$ . Furthermore, it follows by Chebyshev’s inequality that

P\{||Z||^{2}<d+t^{2}\}=P\{||\theta||^{2}-t^{2}<d+||\theta||^{2}-||Z||^{2}\}\leq\frac{2d+4||\theta||^{2}}{(||\theta||^{2}-t^{2})^{2}}.

Consequently,

	$\displaystyle\operatorname{Var}(Y)$	$\displaystyle\leq 2d+4\|\|\theta\|\|^{2}+\frac{2d+4\|\|\theta\|\|^{2}}{(\|\|\theta\|\|^{2}-t^{2})^{2}}(cc^{\prime}\sqrt{d}+\|\|\theta\|\|^{2}-t^{2})^{2}$
		$\displaystyle\asymp d+\|\|\theta\|\|^{2}+\frac{d+\|\|\theta\|\|^{2}}{(\|\|\theta\|\|^{2}-t^{2})^{2}}(\sqrt{d}+\|\|\theta\|\|^{2})^{2}$
		$\displaystyle\asymp d+\|\|\theta\|\|^{2}+\frac{d+\|\|\theta\|\|^{2}}{\|\|\theta\|\|^{4}}\|\|\theta\|\|^{4}$
		$\displaystyle\asymp d+\|\|\theta\|\|^{2}.$

The proof for this case is complete.

Case 3: Consider the case $0<||\theta||<2t$ . Then

	$\displaystyle E(\|\|Z\|\|^{2}-\alpha_{t}(d)\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle=E(\|\|g\|\|^{2}-d\,\|\,\|\|g+\theta\|\|^{2}\geq d+t^{2})+\|\|\theta\|\|^{2}-t^{2}$
		$\displaystyle\leq E(\|\|g\|\|^{2}-d\,\|\,\|\|g+\theta\|\|^{2}\geq d+t^{2})+3t^{2}$
		$\displaystyle\leq E(\|\|g\|\|^{2}-d\,\|\,\|\|g\|\|^{2}\geq d+t^{2})+3t^{2}$
		$\displaystyle=\alpha_{t}(d)-d+3t^{2}$
		$\displaystyle\lesssim\beta\sqrt{d}+t^{2}$
		$\displaystyle\asymp t^{2}$

where we have used Lemma 21. Therefore, using the above bound with (56) and plugging into (55), we obtain

\operatorname{Var}(Y)\lesssim d+||\theta||^{2}+t^{4}\asymp t^{4}.

The proof is complete. ∎

Lemma 15.

Let $L^{*}$ be the universal positive constant from Lemma 21. There exists universal positive constants $D$ and $C^{*}$ such that if $d\geq D$ , then for any $1\leq\beta\leq L^{*}\sqrt{d}$ and $x>0$ , we have

P\left\{\sum_{j=1}^{n}\left(||Z_{j}||^{2}-\alpha_{t}(d)\right)\mathbbm{1}_{\{||Z_{j}||^{2}\geq d+t^{2}\}}\geq C^{*}\left(\sqrt{xnt^{4}e^{-\frac{ct^{4}}{d}}}+\frac{d}{t^{2}}x\right)\right\}\leq e^{-x}

where $Z_{1},...,Z_{n}\overset{iid}{\sim}N(0,I_{d})$ , $t^{2}=\beta\sqrt{d}$ , and $\alpha_{t}(d)$ is given by (13). Here, $c>0$ is a universal constant.

Proof.

The proof of Lemma 11 can be repeated with the modification of invoking Lemma 12 instead of Lemma 8 and taking infimum over $\lambda<\frac{\beta}{2\left(\beta C^{*}+\sqrt{d}\right)}$ . ∎

Appendix B Properties of the $\chi^{2}_{d}$ distribution

Theorem 7 (Bernstein’s inequality - Theorem 2.8.1 [46]).

Let $Y_{1},...,Y_{d}$ be independent mean-zero subexponential random variables. Then, for every $u\geq 0$ , we have

P\left\{\left|\sum_{i=1}^{d}Y_{i}\right|\geq u\right\}\leq 2\exp\left(-c\min\left(\frac{u^{2}}{\sum_{i=1}^{d}||X_{i}||^{2}_{\psi_{1}}},\frac{u}{\max_{i}||X_{i}||_{\psi_{1}}}\right)\right)

where $c>0$ is a universal constant and $\psi_{2},\psi_{1}$ denote the subgaussian and subexponential norms respectively (see (2.13) and (2.21) of [46])

Corollary 1.

Suppose $Z\sim N(0,I_{d})$ . If $u\geq 0$ , then

P\left\{||Z||^{2}\geq d+u\right\}\leq 2\exp\left(-c\min\left(\frac{u^{2}}{d},u\right)\right)

where $c>0$ is a universal constant.

Lemma 16 (Lemma 1 [31]).

Let $Z_{1},...,Z_{d}\overset{iid}{\sim}N(0,1)$ . If $\lambda_{1}\geq...\geq\lambda_{d}>0$ and $x>0$ , then

P\left\{\sum_{j=1}^{d}\lambda_{j}Z_{j}^{2}\geq\sum_{j=1}^{d}\lambda_{j}+2\sqrt{x\sum_{j=1}^{d}\lambda_{j}^{2}}+2\lambda_{1}x\right\}\leq e^{-x}.

Lemma 17 (Corollary 3 [51]).

Suppose $Z\sim N(0,I_{d})$ . There exist universal constants $C,c>0$ such that

P\left\{||Z||^{2}\geq d+u\right\}\geq C\exp\left(-c\min\left(\frac{u^{2}}{d},u\right)\right)

for all $u>0$ .

Lemma 18 ([28]).

Let $f_{d}$ and $F_{d}$ respectively denote the probability density and cumulative distribution functions of the $\chi^{2}_{d}$ distribution. Then the following relations hold,

(i)

$tf_{d}(t)=df_{d+2}(t)$ ,
(ii)

$t^{2}f_{d}(t)=d(d+2)f_{d+4}(t)$ ,
(iii)

$f_{d+2}^{\prime}(t)=\frac{f_{d}(t)-f_{d+2}(t)}{2}$ ,
(iv)

$(1-F_{d+2}(t))-(1-F_{d}(t))=\frac{e^{-t/2}t^{d/2}}{2^{d/2}\Gamma(d/2+1)}$ .

Lemma 19.

Let $f_{d}$ and $F_{d}$ respectively denote the probability density and cumulative distribution functions of the $\chi^{2}_{d}$ distribution. Suppose $d>2$ . If $t\geq d$ , then $2f_{d}(t)\leq 1-F_{d}(t)$ .

Proof.

By Mean Value Theorem, we have for any $r\geq t$ ,

	$\displaystyle\inf_{x\geq d}\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)}$	$\displaystyle\leq\frac{(1-F_{d}(t))-(1-F_{d}(r))}{2f_{d}(t)-2f_{d}(r)}$
		$\displaystyle=\frac{\frac{1-F_{d}(t)}{2f_{d}(t)}-\frac{1-F_{d}(r)}{2f_{d}(t)}}{1-\frac{f_{d}(r)}{f_{d}(t)}}.$

Consider that $\lim_{r\to\infty}1-F_{d}(r)=\lim_{r\to\infty}f_{d}(r)=0$ . So taking $r\to\infty$ yields

\inf_{x\geq d}\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)}\leq\frac{1-F_{d}(t)}{2f_{d}(t)}.

We now evaluate the infimum on the left-hand side. For $x\geq d$ , consider that an application of Lemma 18 gives

	$\displaystyle\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)}$	$\displaystyle=-\frac{f_{d}(x)}{f_{d-2}(x)-f_{d}(x)}$
		$\displaystyle=-\frac{1}{\frac{f_{d-2}(x)}{f_{d}(x)}-1}$
		$\displaystyle=-\frac{1}{\frac{xf_{d-2}(x)}{xf_{d}(x)}-1}$
		$\displaystyle=-\frac{1}{\frac{d-2}{x}-1}$
		$\displaystyle=\frac{1}{1-\frac{d-2}{x}}.$

Since $x\geq d>d-2$ , it follows that

\inf_{x\geq d}\frac{\frac{\partial}{\partial x}(1-F_{d}(x))}{2f_{d}^{\prime}(x)}=1.

Thus we can immediately conclude $2f_{d}(t)\leq 1-F_{d}(t)$ as desired. ∎

Lemma 20.

Let $F_{d}$ denote the cumulative distribution function of the $\chi^{2}_{d}$ distribution. If $x\geq 0$ , then

1-F_{d}(x)=\frac{1}{\Gamma\left(\frac{d}{2}\right)}\int_{x/2}^{\infty}t^{d/2-1}e^{-t}\,dt=Q\left(\frac{d}{2},\frac{x}{2}\right)

where $Q$ is the upper incomplete gamma function defined in Theorem 8.

Proof.

The result follows directly from a change of variables when integrating the probability density function. ∎

Lemma 21.

Suppose $d$ is larger than a sufficiently large universal constant. There exist universal positive constants $L^{*}$ and $C^{*}$ such that the following holds. If $1\leq\beta\leq L^{*}\sqrt{d}$ and $t^{2}=\beta\sqrt{d}$ , then

\alpha_{t}(d)\leq d+C^{*}\beta\sqrt{d}

where $\alpha_{t}(d)$ is given by (13).

Proof.

For ease of notation, let $r^{2}=d+t^{2}$ . Let $f_{d}$ and $F_{d}$ denote the probability density and cumulative distribution functions of the $\chi^{2}_{d}$ distribution. We will select the universal constant $L^{*}$ later on in the proof. By Lemma 18, we have

	$\displaystyle\alpha_{t}(d)$	$\displaystyle=\frac{\int_{r^{2}}^{\infty}zf_{d}(z)\,dz}{1-F_{d}(r^{2})}$
		$\displaystyle=\frac{\int_{r^{2}}^{\infty}df_{d+2}(z)\,dz}{1-F_{d}(r^{2})}$
		$\displaystyle=d\left(1+\frac{\left(1-F_{d+2}(r^{2})\right)-\left(1-F_{d}(r^{2})\right)}{1-F_{d}(r^{2})}\right)$
		$\displaystyle=d\left(1+\frac{r^{d}e^{-r^{2}/2}}{2^{d/2}\Gamma\left(\frac{d}{2}+1\right)}\cdot\frac{1}{1-F_{d}(r^{2})}\right)$
		$\displaystyle=d\left(1+\left(\frac{r^{2}}{2}\right)^{d/2}\frac{e^{-r^{2}/2}}{\Gamma\left(\frac{d}{2}+1\right)}\cdot\frac{1}{1-F_{d}(r^{2})}\right)$

Rearranging terms and invoking Stirling’s approximation (which states $\Gamma(x+1)\sim\sqrt{2\pi x}\left(\frac{x}{e}\right)^{x}$ as $x\to\infty$ ) yields

\left(\frac{r^{2}}{2}\right)^{d/2}\frac{e^{-r^{2}/2}}{\Gamma\left(\frac{d}{2}+1\right)}\leq\frac{1+c}{\sqrt{\pi d}}\exp\left(\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{r^{2}}{2}-\frac{d}{2}\log\left(\frac{d}{2}\right)+\frac{d}{2}\right)

for a universal constant $c>0$ since $d$ is larger than a sufficiently large universal constant. Applying Lemma 20 and Corollary 2 yields

1-F_{d}(r^{2})=Q\left(\frac{d}{2},\frac{r^{2}}{2}\right)\geq\left(1-\Phi(\eta\sqrt{a})\right)-\frac{e^{-a\eta^{2}/2}}{\sqrt{2\pi}}\cdot\frac{c^{**}}{\sqrt{d}}

where $c^{**}$ is a universal constant. Observe that $\eta\sqrt{a}$ is larger than a sufficiently large universal constant since $d$ is larger than a sufficiently large universal constant. Using the fact that $1-\Phi(x)=\frac{1}{x\sqrt{2\pi}}e^{-x^{2}/2}\left(1+o(1)\right)$ as $x\to\infty$ , we have

1-F_{d}(r^{2})\geq\frac{1}{\sqrt{2\pi}}e^{-\frac{a\eta^{2}}{2}}\left(\frac{c^{*}}{\eta\sqrt{a}}-\frac{c^{**}}{\sqrt{d}}\right)

for a universal positive constant $c^{*}$ . Consider that

\frac{\eta^{2}}{2}=\mu-\log(1+\mu)=\frac{\frac{r^{2}}{2}}{\frac{d}{2}}-1-\log\left(\frac{r^{2}}{2}\right)+\log\left(\frac{d}{2}\right).

Consequently,

-\frac{\eta^{2}a}{2}=-\left(\frac{r^{2}}{2}-\frac{d}{2}-\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)+\frac{d}{2}\log\left(\frac{d}{2}\right)\right)=-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right).

Therefore, we have the bound

1-F_{d}(r^{2})\geq\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right)\right)\cdot\left(\frac{c^{*}}{\eta\sqrt{a}}-\frac{c^{**}}{\sqrt{d}}\right).

Consider further that the inequality $x-\log(1+x)\leq\frac{x^{2}}{2}\leq x^{2}$ gives us

\displaystyle\eta\sqrt{a}=\sqrt{\frac{d}{2}}\cdot\sqrt{2\left(\frac{\beta}{\sqrt{d}}-\log\left(1+\frac{\beta}{\sqrt{d}}\right)\right)}=\sqrt{d\left(\frac{\beta}{\sqrt{d}}-\log\left(1+\frac{\beta}{\sqrt{d}}\right)\right)}\leq\beta.

Hence, we have

1-F_{d}(r^{2})\geq\frac{1}{\sqrt{2\pi}}\exp\left(-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right)\right)\left(\frac{c^{*}}{\beta}-\frac{c^{**}}{\sqrt{d}}\right).

Let us take $L^{*}:=\frac{c^{*}}{2c^{**}}$ . With this choice, we have $\beta\leq\sqrt{d}\cdot\frac{c^{*}}{2c^{**}}$ and so $\left(\frac{c^{*}}{\beta}-\frac{c^{**}}{\sqrt{d}}\right)\geq\frac{c^{*}}{2\beta}$ . Therefore,

\left(\frac{r^{2}}{2}\right)^{d/2}\frac{e^{-r^{2}/2}}{\Gamma\left(\frac{d}{2}+1\right)}\cdot\frac{1}{1-F_{d}(r^{2})}\leq\frac{1+c}{\sqrt{\pi d}}\cdot\frac{2\beta\sqrt{2\pi}}{c^{*}}\cdot\frac{\exp\left(\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{r^{2}}{2}-\frac{d}{2}\log\left(\frac{d}{2}\right)+\frac{d}{2}\right)}{\exp\left(-\frac{r^{2}}{2}+\frac{d}{2}+\frac{d}{2}\log\left(\frac{r^{2}}{2}\right)-\frac{d}{2}\log\left(\frac{d}{2}\right)\right)}\leq C^{*}\frac{\beta}{\sqrt{d}}

for a universal positive constant $C^{*}$ . Thus we have

\alpha_{t}(d)\leq d\left(1+\frac{C^{*}\beta}{\sqrt{d}}\right)=d+C^{*}\beta\sqrt{d}

as desired. ∎

Theorem 8 (Uniform expansion of the incomplete gamma function [43]).

For $a>0$ and $x\geq 0$ real numbers, define the upper incomplete gamma function

Q(a,x):=\frac{1}{\Gamma(a)}\int_{x}^{\infty}t^{a-1}e^{-t}\,dt.

Further define $\lambda=\frac{x}{a},\mu=\lambda-1$ , and $\eta=\sqrt{2(\mu-\log(1+\mu))}$ . Then $Q$ admits an asymptotic series expansion in $a$ which is uniform in $\eta\in\mathbb{R}$ . In other words, for any integer $N\geq 0$ , we have

Q(a,x)=\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}\sum_{k=0}^{N}c_{k}(\eta)a^{-k}+\operatorname{Rem}_{N}(a,\eta)

where the remainder term satisfies

\lim_{a\to\infty}\sup_{\eta\in\mathbb{R}}\left|\frac{\operatorname{Rem}_{N}(a,\eta)}{\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}c_{N}(\eta)a^{-N}}\right|=0.

Here, $\Phi$ denotes the cumulative distribution function of the standard normal distribution.

Theorem 9 (Theorem 1 in [42]).

Consider the setting of Theorem 8. The coefficient $c_{0}$ is given by $c_{0}(\eta)=\frac{1}{\mu}-\frac{1}{\eta}$ .

Corollary 2.

Consider the setting of Theorem 8. For any $a>0$ and $x\geq 0$ , we have

Q(a,x)=\left(1-\Phi\left(\eta\sqrt{a}\right)\right)+\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}\left(\frac{1}{\mu}-\frac{1}{\eta}\right)+\operatorname{Rem}_{0}(a,\eta)

where

\lim_{a\to\infty}\sup_{\eta\in\mathbb{R}}\left|\frac{\operatorname{Rem}_{0}(a,\eta)}{\frac{e^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}\left(\frac{1}{\mu}-\frac{1}{\eta}\right)}\right|=0.

Consequently, if $a$ is larger than some universal positive constant and $\mu>0$ , we have

\left|Q(a,x)-\left(1-\Phi\left(\eta\sqrt{a}\right)\right)\right|\leq\frac{Ce^{-\frac{a\eta^{2}}{2}}}{\sqrt{2\pi a}}

where $C$ is some universal positive constant.

Proof.

The first two displays follow exactly from Theorems 8 and 9. To show the final display, we must show that $\left|\frac{1}{\mu}-\frac{1}{\eta}\right|\lesssim 1$ whenever $\mu>0$ . First, consider the Taylor expansion

\log(1+\mu)=\mu-\frac{\mu^{2}}{2(1+\xi)^{2}}

where $\xi$ is some point between $0$ and $\mu$ . Therefore,

\sqrt{2(\mu-\log(1+\mu))}=\frac{\mu}{1+\xi}.

Thus

\displaystyle\left|\frac{1}{\mu}-\frac{1}{\eta}\right|=\left|\frac{1}{\mu}-\frac{1+\xi}{\mu}\right|=\frac{\xi}{\mu}\leq 1

since $\xi$ is between $0$ and $\mu$ . Therefore, the final display in the statement of the Corollary follows by taking $a$ to be larger than some universal constant and taking $C$ to be a large enough universal constant. ∎

Lemma 22.

Let $Z\sim N(0,I_{d})$ . If $t\geq 0$ , then

E(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})\leq d+C^{*}(t^{2}\vee d)

where $C^{*}$ is a positive universal constant.

Proof.

We will choose a universal constant $L\geq 1$ at the end of the proof, so for now we leave it as undetermined. Let $f_{d}$ denote the probability density function of the $\chi^{2}_{d}$ distribution. Observe that

	$\displaystyle E(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle=E(\|\|Z\|\|^{2}\mathbbm{1}_{\{\|\|Z\|\|^{2}\leq d+Lt^{2}\}}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})+E(\|\|Z\|\|^{2}\mathbbm{1}_{\{\|\|Z\|\|^{2}>d+Lt^{2}\}}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
		$\displaystyle\leq d+Lt^{2}+\sqrt{E(\|\|Z\|\|^{4}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})}\sqrt{P\{\|\|Z\|\|^{2}>d+Lt^{2}\}}.$

By Corollary 1, there exists a universal constant $c_{1}$ such that

\sqrt{P\{||Z||^{2}>d+Lt^{2}\}}\leq\sqrt{2}\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

Here we have used $L\geq 1$ . Further consider that

	$\displaystyle E(\|\|Z\|\|^{4}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle=\frac{\int_{d+t^{2}}^{\infty}z^{2}f_{d}(z)\,dz}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}$
		$\displaystyle\leq\frac{E(\|\|Z\|\|^{4})}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}$
		$\displaystyle=\frac{d(d+2)}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}.$

Therefore, by Lemma 17 we have

E(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})\leq d+Lt^{2}+dC\cdot\frac{\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}

where $C$ and $c_{2}$ are universal positive constants. Taking $L:=\frac{c_{2}}{c_{1}}\vee 1$ completes the proof. ∎

Lemma 23.

Let $Z\sim N(0,I_{d})$ . If $t\geq 0$ , then

E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2})\leq d^{2}+C^{*}(t^{4}\vee d^{2})

where $C^{*}$ is a positive universal constant.

Proof.

	$\displaystyle E(\|\|Z\|\|^{4}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
	$\displaystyle=E(\|\|Z\|\|^{4}\mathbbm{1}_{\{\|\|Z\|\|^{2}\leq d+Lt^{2}\}}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})+E(\|\|Z\|\|^{4}\mathbbm{1}_{\{\|\|Z\|\|^{2}\leq d+Lt^{2}\}}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
	$\displaystyle\leq d^{2}+2Lt^{2}d+L^{2}t^{4}+\sqrt{E(\|\|Z\|\|^{8}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})}\sqrt{P\{\|\|Z\|\|^{2}>d+Lt^{2}\}}.$

By Corollary 1, there exists a universal positive constant $c_{1}$ such that

\sqrt{P\{||Z||^{2}>d+Lt^{2}\}}\leq\sqrt{2}\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

Here we have used $L\geq 1$ . Further consider

	$\displaystyle E(\|\|Z\|\|^{8}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle=\frac{\int_{d+t^{2}}^{\infty}z^{4}f_{d}(z)\,dz}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}$
		$\displaystyle\leq\frac{E(\|\|Z\|\|^{8})}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}$
		$\displaystyle=\frac{d(d+2)(d+4)(d+6)}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}.$

Hence, by Lemma 17 we have

E(||Z||^{4}\,|\,||Z||^{2}\geq d+t^{2})\leq d^{2}+2Lt^{2}d+L^{2}t^{4}+d^{2}C\frac{\exp\left(-Lc_{1}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}

where $C>0$ is a universal constant. Taking $L>\frac{c_{2}}{c_{1}}\vee 1$ completes the proof. ∎

Lemma 24.

Let $Z\sim N(0,I_{d})$ . If $t\geq 0$ , then

\operatorname{Var}(||Z||^{2}\,|\,||Z||^{2}\geq d+t^{2})\lesssim t^{4}+de^{-C\min\left(\frac{t^{4}}{d},t^{2}\right)}

for some universal positive constant $C$ .

Proof.

We will use a universal constant $L\geq 1$ in our proof; a choice for it will be made at the end. Let $f_{d}$ denote the probability density of the $\chi^{2}_{d}$ distribution. Observe that

	$\displaystyle\operatorname{Var}(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
	$\displaystyle=\operatorname{Var}(\|\|Z\|\|^{2}-d\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
	$\displaystyle\leq E((\|\|Z\|\|^{2}-d)^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
	$\displaystyle=E((\|\|Z\|\|^{2}-d)^{2}\mathbbm{1}_{\{\|\|Z\|\|^{2}\leq d+Lt^{2}\}}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})+E((\|\|Z\|\|^{2}-d)^{2}\mathbbm{1}_{\{\|\|Z\|\|^{2}>d+Lt^{2}\}}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$
	$\displaystyle\leq L^{2}t^{4}+\sqrt{E((\|\|Z\|\|^{2}-d)^{4}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})}\sqrt{P\{\|\|Z\|\|^{2}\geq d+Lt^{2}\}}.$

By Corollary 1, there exists a universal positive constant $c_{1}$ such that

\sqrt{P\{|Z||^{2}\geq d+Lt^{2}\}}\leq\sqrt{2}\exp\left(-c_{1}L\min\left(\frac{t^{4}}{d},t^{2}\right)\right).

Here, we have used $L\geq 1$ . With this term under control, now observe

	$\displaystyle E((\|\|Z\|\|^{2}-d)^{4}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle=\frac{\int_{d+t^{2}}^{\infty}(z-d)^{2}f_{d}(z)\,dz}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}$
		$\displaystyle\leq\frac{E((\|\|Z\|\|^{2}-d)^{4})}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}$
		$\displaystyle=\frac{12d(d+4)}{P\{\|\|Z\|\|^{2}\geq d+t^{2}\}}.$

Hence, by Lemma 17 we have

\sqrt{E((||Z||^{2}-d)^{4}\,|\,||Z||^{2}\geq d+t^{4})}\leq\frac{C^{*}d}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}.

Taking $L>\frac{c_{2}}{c_{1}}\vee 1$ , we have

	$\displaystyle\operatorname{Var}(\|\|Z\|\|^{2}\,\|\,\|\|Z\|\|^{2}\geq d+t^{2})$	$\displaystyle\leq L^{2}t^{4}+\frac{C^{*}d}{\exp\left(-c_{2}\min\left(\frac{t^{4}}{d},t^{2}\right)\right)}\cdot\sqrt{2}\exp\left(-c_{1}L\min\left(\frac{t^{4}}{d},t^{2}\right)\right)$
		$\displaystyle\leq L^{2}t^{4}+d\cdot C^{*}\sqrt{2}\exp\left(-(c_{1}L-c_{2})\min\left(\frac{t^{4}}{d},t^{2}\right)\right).$

which is precisely the desired result. ∎

Appendix C Miscellaneous

C.1 Minimax

Lemma 25.

We have

\Gamma_{\mathcal{H}}=\mu_{\nu_{\mathcal{H}}}\vee\frac{\sqrt{(\nu_{\mathcal{H}}-1)\log\left(1+\frac{p}{s^{2}}\right)}}{n}.

Consequently, $\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}-1}$ .

Proof.

Consider that

	$\displaystyle\Gamma_{\mathcal{H}}$	$\displaystyle=\max_{\nu<\nu_{\mathcal{H}}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}$
		$\displaystyle=\max_{\nu<\nu_{\mathcal{H}}}\left\{\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{p}{s^{2}}\right)}}{n}\right\}$
		$\displaystyle=\frac{\sqrt{(\nu_{\mathcal{H}}-1)\log\left(1+\frac{p}{s^{2}}\right)}}{n}\vee\mu_{\nu_{\mathcal{H}}}.$

Here, we have used the ordering of the eigenvalues and the fact that $\mu_{\nu_{\mathcal{H}}}\leq\frac{\sqrt{\nu_{\mathcal{H}}\log\left(1+\frac{p}{s^{2}}\right)}}{n}$ to obtain the second term. The bound $\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}-1}$ follows immediately from $\mu_{\nu_{\mathcal{H}}}\leq\mu_{\nu_{\mathcal{H}}-1}$ and $\frac{\sqrt{(\nu_{\mathcal{H}}-1)\log\left(1+\frac{p}{s^{2}}\right)}}{n}\leq\mu_{\nu_{\mathcal{H}}-1}$ by definition of $\nu_{\mathcal{H}}$ . ∎

Proof of Lemma 1.

The result is a special case of Lemma 2. ∎

C.2 Adaptation

Lemma 26.

Suppose $a>0$ . We have

\Gamma_{\mathcal{H}}(s,a)=\mu_{\nu_{\mathcal{H}}(s,a)}\vee\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

Consequently, $\Gamma_{\mathcal{H}}(s,a)\leq\mu_{\nu_{\mathcal{H}}(s,a)-1}$ .

Proof.

Consider that

	$\displaystyle\Gamma_{\mathcal{H}}(s,a)$	$\displaystyle=\max_{\nu<\nu_{\mathcal{H}}(s,a)}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}(s,a)}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}$
		$\displaystyle=\max_{\nu<\nu_{\mathcal{H}}(s,a)}\left\{\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}\vee\max_{\nu\geq\nu_{\mathcal{H}}(s,a)}\left\{\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right\}$
		$\displaystyle=\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\vee\mu_{\nu_{\mathcal{H}}(s,a)}.$

Here, we have used the ordering of the eigenvalues and the fact that $\mu_{\nu_{\mathcal{H}}(s,a)}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}$ to obtain the second term. To bound $\Gamma_{\mathcal{H}}\leq\mu_{\nu_{\mathcal{H}}(s,a)-1}$ follows immediately from $\mu_{\nu_{\mathcal{H}}}\leq\mu_{\nu_{\mathcal{H}}(s,a)-1}$ and $\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\mu_{\nu_{\mathcal{H}}(s,a)-1}$ by definition of $\nu_{\mathcal{H}}(s,a)$ . ∎

Proof of Lemma 2.

We first prove the second inequality. Since $\log\left(1+\frac{pa}{s^{2}}\right)\leq\frac{n}{2}$ and $\mu_{1}=1$ , it immediately follows that $\nu_{\mathcal{H}}(s,a)\geq 2$ . Therefore,

	$\displaystyle\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}$	$\displaystyle\leq\sqrt{2}\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}$
		$\displaystyle\leq\sqrt{2}\left(\mu_{\nu_{\mathcal{H}}(s,a)}\vee\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\right)$
		$\displaystyle\leq\sqrt{2}\Gamma_{\mathcal{H}}(s,a)$

as desired.

Now we prove the first inequality. For any $\nu\geq\nu_{\mathcal{H}}(s,a)$ , we have by the ordering of the eigenvalues

\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\mu_{\nu}\leq\mu_{\nu_{\mathcal{H}}}(s,a)\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

For any $\nu<\nu_{\mathcal{H}}(s,a)$ , by definition of $\nu_{\mathcal{H}}(s,a)$ it follows that

\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}=\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}.

Since for all $\nu$ we have shown

\mu_{\nu}\wedge\frac{\sqrt{\nu\log\left(1+\frac{pa}{s^{2}}\right)}}{n}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n},

taking max over $\nu$ yields

\Gamma_{\mathcal{H}}(s,a)\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}

as desired. ∎

Proof of Lemma 3.

For ease of notation, let us write $a=\mathscr{A}_{\mathcal{H}}$ . By the assumption $\log\left(1+pa\right)\leq\frac{n}{2}$ and $\mu_{1}=1$ we have $\nu_{\mathcal{H}}(s,a)\geq 2$ for all $s$ . By definition of $\nu_{\mathcal{H}}(s,a)$ , we have

\mu_{\nu_{\mathcal{H}}(s,a)}\leq\frac{\sqrt{\nu_{\mathcal{H}}(s,a)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}

and

\mu_{\nu_{\mathcal{H}}(s,a)-1}>\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{pa}{s^{2}}\right)}}{n}

Define

\delta_{s}:=\inf\left\{\delta>0:\mu_{\nu(s,a)-1}\leq\frac{\sqrt{(\nu_{\mathcal{H}}(s,a)-1)\log\left(1+\frac{p(a+\delta)}{s^{2}}\right)}}{n}\right\}.

It is clear $\delta_{s}>0$ for each $s$ since the inequality in the penultimate display is strict. Moreover, observe that $\nu_{\mathcal{H}}(s,a+\delta)>\nu_{\mathcal{H}}(s,a)-1$ for all $\delta<\delta_{s}$ by definition of $\nu_{\mathcal{H}}$ . Moreover, since $\nu_{\mathcal{H}}$ is decreasing in its second argument, it thus follows that $\nu_{\mathcal{H}}(s,a+\delta)=\nu_{\mathcal{H}}(s,a)$ for all $\delta<\delta_{s}$ . Letting $\delta^{*}:=1\wedge\min_{s\in[p]}\delta_{s}$ , it immediately follows that $\mathscr{V}_{a}=\mathscr{V}_{a+\delta}$ for all $\delta<\delta^{*}$ . By definition of $\mathscr{A}_{\mathcal{H}}$ , it follows that for any $\delta<\delta^{*}$

\mathscr{A}_{\mathcal{H}}=a\leq\log(e|\mathscr{V}_{a}|)=\log(e|\mathscr{V}_{a+\delta}|)<a+\delta\leq 2\mathscr{A}_{\mathcal{H}}

where we have used $\mathscr{A}_{\mathcal{H}}\geq 1$ and $\delta\leq 1$ to obtain the final inequality. The proof is complete. ∎

C.3 Technical

Proposition 5 (Ingster-Suslina method [24]).

Suppose $\Sigma\in\mathbb{R}^{d\times d}$ is a positive definite matrix and $\Theta\subset\mathbb{R}^{d}$ is a parameter space. If $\pi$ is a probability distribution supported on $\Theta$ , then

\chi^{2}\left(\left.\left.\int N(\theta,\Sigma)\,\pi(d\theta)\right|\right|N(0,\Sigma)\right)=E\left(\exp\left(\left\langle\theta,\Sigma^{-1}\widetilde{\theta}\right\rangle\right)\right)-1

where $\theta,\widetilde{\theta}\overset{iid}{\sim}\pi$ and $\chi^{2}(\cdot||\cdot)$ denotes the $\chi^{2}$ -divergence defined in Section 1.1.

Lemma 27 ([45]).

If $P,Q$ are probability measures on a measurable space $(\mathcal{X},\mathcal{A})$ with $P\ll Q$ , then $d_{TV}(P,Q)\leq\frac{1}{2}\sqrt{\chi^{2}(P||Q)}$ .

Lemma 28.

If $Y$ is distributed according to the hypergeometric distribution with probability mass function $P\{Y=k\}=\frac{\binom{s}{k}\binom{n-s}{t-k}}{\binom{n}{t}}$ for $0\leq k\leq s\wedge t$ , then $E(\exp(\lambda Y))\leq\left(1-\frac{s}{n}+\frac{s}{n}e^{\lambda}\right)^{t}$ for $\lambda>0$ .

Corollary 3 ([10]).

If $Y$ is distributed according to the hypergeometric distribution with probability mass function $P\{Y=k\}=\frac{\binom{s}{k}}{\binom{n-s}{s-k}}{\binom{n}{s}}$ for $0\leq k\leq s\leq n$ , then $E(Y)=\frac{s^{2}}{n}$ and $E(\exp(\lambda Y))\leq\left(1-\frac{s}{n}+\frac{s}{n}e^{\lambda}\right)^{s}$ for $\lambda>0$ .

	$\displaystyle\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}P_{f}\left\{T\leq p\nu_{\mathcal{H}}+L\sqrt{p\nu_{\mathcal{H}}}\right\}$
	$\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}P_{f}\left\{E_{f}(T)-p\nu_{\mathcal{H}}-L\sqrt{p\nu_{\mathcal{H}}}\leq E_{f}(T)-T\right\}$
	$\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{\operatorname{Var}_{f}\left(T\right)}{\left(E_{f}(T)-p\nu_{\mathcal{H}}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}$
	$\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2p\nu_{\mathcal{H}}+4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}$
	$\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2p\nu_{\mathcal{H}}}{\left(C_{\eta}^{2}-1-L\right)^{2}\nu_{\mathcal{H}}p}+\frac{4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}-L\sqrt{p\nu_{\mathcal{H}}}\right)^{2}}$
	$\displaystyle\leq\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{4n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}{\left(\frac{1}{2}n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}\right)^{2}}$
	$\displaystyle=\sup_{\begin{subarray}{c}f\in\mathcal{F}_{s},\\ \|\|f\|\|_{2}\geq C_{\eta}\tau(p,s,n)\end{subarray}}\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{n\sum_{j=1}^{p}\sum_{k\leq\nu_{\mathcal{H}}}\theta_{k,j}^{2}}$
	$\displaystyle\leq\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{(C_{\eta}^{2}-1)\sqrt{\nu_{H}p}}$
	$\displaystyle\leq\frac{2}{\left(C_{\eta}^{2}-1-L\right)^{2}}+\frac{16}{C_{\eta}^{2}-1}$
	$\displaystyle\leq\frac{\eta}{4}+\frac{\eta}{4}$
	$\displaystyle\leq\frac{\eta}{2}.$

	$\displaystyle E\left(\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(k_{s},k_{t})\in E_{1}\}}\right)$	$\displaystyle=\frac{1}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{2}}\sum_{v,v^{\prime}\in\widetilde{\mathscr{V}}_{\mathcal{H}}}\exp\left(8c^{4}2^{-\frac{\|k_{s}-k_{t}\|}{2}}\mathscr{A}_{\mathcal{H}}\right)\mathbbm{1}_{\{(2^{k_{s}},2^{k_{t}})\in E_{1}\}}$
		$\displaystyle\leq\frac{\|E_{1}\|}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{2}}\exp\left(8c^{4}\mathscr{A}_{\mathcal{H}}\right)$
		$\displaystyle\leq\frac{\frac{\eta^{2}}{2e^{8Lc^{4}}}\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|\log_{2}(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{2}}\cdot\exp\left(8Lc^{4}\log(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)\right)$
		$\displaystyle\leq\frac{\eta^{2}}{2}\cdot\frac{\log_{2}(e\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{1-8Lc^{4}}}$
		$\displaystyle\leq\eta^{2}+\frac{\eta^{2}}{2\log 2}\cdot\frac{\log(\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|)}{\|\widetilde{\mathscr{V}}_{\mathcal{H}}\|^{1-8Lc^{4}}}$
		$\displaystyle\leq\eta^{2}+\frac{\eta^{2}}{2\log 2}$
		$\displaystyle\leq 2\eta^{2}.$

	$\displaystyle n\sum_{j=1}^{p}\sum_{k\leq\tilde{\nu}}\theta_{k,j}^{2}$	$\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)-ns^{}\mu_{\tilde{\nu}}$
		$\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)-s^{}\sqrt{\tilde{\nu}\log\left(1+\frac{p\mathscr{A}_{\mathcal{H}}}{(s^{*})^{2}}\right)}$
		$\displaystyle\geq C^{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)-\sqrt{2}n\tau_{\text{adapt}}^{2}(p,s^{},n)$
		$\displaystyle\geq(C^{2}-\sqrt{2})n\tau_{\text{adapt}}^{2}(p,s^{*},n)$

	$\displaystyle\chi^{2}(P_{\pi}\,\|\|\,P_{0})+1$	$\displaystyle=E\left(\exp\left(n\langle\Theta,\Theta^{\prime}\rangle_{F}\right)\right)$
		$\displaystyle=E\left(\exp\left(c^{2}n\rho_{\nu}\rho_{\nu^{\prime}}\sum_{i\in S\cap S^{\prime}}\sum_{\ell=1}^{\nu\wedge\nu^{\prime}}r_{i\ell}r_{i\ell}^{\prime}\right)\right)$
		$\displaystyle\leq E\left(\exp\left(\frac{c^{4}}{2}n^{2}\rho_{\nu}^{2}\rho_{\nu^{\prime}}^{2}(\nu\wedge\nu^{\prime})\|S\cap S^{\prime}\|\right)\right)$
		$\displaystyle\leq E\left(\exp\left(\frac{c^{4}}{2}n^{2}\sqrt{\rho_{\nu}^{4}\nu\cdot\rho_{\nu^{\prime}}^{4}\nu^{\prime}}\cdot\frac{\nu\wedge\nu^{\prime}}{\sqrt{\nu\nu^{\prime}}}\|S\cap S^{\prime}\|\right)\right).$

	$\displaystyle P_{f^{}}\left\{\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\leq\nu p+K_{\eta}\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}$
	$\displaystyle\leq P_{f^{}}\left\{\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\leq\nu p+K_{\eta}(1+\kappa^{\prime})\left(\sqrt{p\nu\log\log(np)}+\log\log(np)\right)\right\}$
	$\displaystyle=P_{f^{}}\left\{(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)})\leq\nu p+n\|\|f^{}\|\|_{2}^{2}-\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\right\}$
	$\displaystyle\leq\frac{\operatorname{Var}\left(\chi^{2}_{\nu p}(n\|\|f^{}\|\|_{2}^{2})\right)}{\left(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}$
	$\displaystyle\leq\frac{2\nu p}{\left(n\|\|f^{}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}+\frac{4n\|\|f^{}\|\|_{2}^{2}}{\left(n\|\|f^{*}\|\|_{2}^{2}-K_{\eta}(1+\kappa^{\prime})\sqrt{p\nu\log\log(np)}\right)^{2}}.$

Minimax Signal Detection in Sparse Additive Models

Abstract

1 Introduction

1.1 Notation

1.2 Setup

1.2.1 Reproducing Kernel Hilbert Space (RKHS)

1.2.2 Parameter space

1.2.3 Problem

Definition 1.

2 Minimax rates

2.1 A naive ansatz

2.2 Preliminaries

Definition 2.

Definition 3.

Lemma 1.

2.3 Lower bound

Proposition 1 (Triviality).

Theorem 1.

2.4 Upper bound

2.4.1 Hard thresholding in the sparse regime

Proposition 2 (Bulk).

Proposition 3 (Tail).

2.4.2 χ2\chi^{2} tests in the dense regime

Proposition 4.

2.5 Special cases

2.5.1 Sobolev

2.5.2 Finite dimension

2.5.3 Exponential decay

3 Adaptation

3.1 Preliminaries

Definition 4.

Definition 5.

Lemma 2.

Lemma 3.

3.2 Lower bound

Definition 6.

Theorem 2.

3.3 Upper bound

Theorem 3.

3.4 Special cases

3.4.1 Sobolev

3.4.2 Finite dimension

3.4.3 Exponential decay

4 Adaptation to both sparsity and smoothness

4.1 Dense

4.1.1 Upper bound

Theorem 4.

4.1.2 Lower bound

Theorem 5.

4.2 Sparse

Theorem 6.

5 Discussion

5.1 Relaxing the centered assumption

5.2 Sharp constant

5.3 Future directions

6 Acknowledgments

7 Proofs

7.1 Minimax upper bounds

7.1.1 Sparse

Proof of Proposition 2.

Lemma 4.

Lemma 5.

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Proposition 3.

Lemma 6.

Lemma 7.

Proof of Lemma 6.

Proof of Lemma 7.

7.1.2 Dense

Proof of Proposition 4.

7.2 Minimax lower bounds

Proof of Proposition 1.

Proof of Theorem 1.

7.3 Adaptive lower bound

Proof of Theorem 2.

7.4 Adaptive upper bound

Proof of Theorem 3.

7.5 Smoothness and sparsity adaptive lower bounds

Proof of Theorem 5.

2.4.2 $\chi^{2}$ tests in the dense regime

Appendix B Properties of the $\chi^{2}_{d}$ distribution