Minimax Signal Detection in Sparse Additive Models
Abstract
Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing kernel Hilbert space. Unlike the estimation theory, the minimax separation rate reveals a nontrivial interaction between sparsity and the choice of function space. We also investigate adaptation to sparsity and establish an adaptive testing rate for a generic function space; adaptation is possible in some spaces while others impose an unavoidable cost. Finally, adaptation to both sparsity and smoothness is studied in the setting of Sobolev space, and we correct some existing claims in the literature.
1 Introduction
In the interest of interpretability, computation, and circumventing the statistical curse of dimensionality plaguing high dimensional regression, structure is often assumed on the true regression function. Indeed, it might plausibly be argued that sparse linear regression is the distinguishing export of modern statistics. Despite its popularity, circumstances may call for more flexibility to capture nonlinear effects of the covariates. Striking a balance between flexibility and structure, Hastie and Tibshirani [20] proposed generalized additive models (GAMs) as a natural extension to the vaunted linear model. In a GAM, the regression function admits an additive decomposition of univariate (nonlinear) component functions. However, as in the linear model, the sample size must outpace the dimension for consistent estimation. Following modern statistical instinct, a sparse additive model is compelling [33, 38, 29, 50, 35, 39]. The regression function admits an additive decomposition of univariate functions for which only a small subset are nonzero; it is the combination of a GAM and sparsity.
To fix notation, consider the -dimensional Gaussian white noise model
(1) |
for . Though it may be more faithful to practical data analysis to consider the nonparametric regression model (e.g. as in [38]), the Gaussian white noise model is convenient as it avoids distracting technicalities while maintaining focus on the statistical essence. Indeed, the nonparametric statistics literature has a long history of studying the white noise model to understand theoretical limits, relying on well-known asymptotic equivalences [5, 40] which imply, under various conditions, that mathematical results obtained in one model can be ported over to the other model. As our focus is theoretical rather than practical, we follow in this tradition. The generalized additive model asserts the regression function is of the form with being univariate functions belonging to some function space . Likewise, the sparse additive model asserts
for some unknown support set of size denoting the active covariates.
Most of the existing literature has addressed estimation of sparse additive models, primarily in the nonparametric regression setting and with being a reproducing kernel Hilbert space. After a series of works [33, 29, 35, 39], Raskutti et al. [38] (see also [29]) established that a penalized -estimator achieves the minimax estimation rate under various choices of the reproducing kernel Hilbert space . Yuan and Zhou [50] establish minimax estimation rates under a notion of approximate sparsity. As is now seen as typical of estimation theory, the powerful framework of empirical processes is brought down to bear on their proofs. Some articles have also addressed generalizations of the sparse additive model. For example, the authors of [49] consider, among other structures, an additive signal where each component function is actually a multivariate function depending on at most many coordinates and is -Hölder. The authors go on to derive minimax rates that handle heterogeneity in the smoothness indices and the sparsities of the coordinate functions; as a taste of their results, they show, in a particular regime and under some conditions, the rate in the special, homogeneous case where and for all . Recently, the results of [3] show certain deep neural networks can achieve the minimax estimation rate for sparse -way interaction models. The -way interaction model is also known as nonparametric ANOVA. To give an example, the sparse -way interaction model assumes where the sets of active variables and interactions have small cardinalities. When the are -Hölder and the are -Hölder, [3] establishes, under some conditions and up to factors logarithmic in , the rate .
The literature has much less to say on the problem of signal detection
(2) | ||||
(3) |
where is the class of sparse additive signals given by (4). Adopting a minimax perspective [6, 23, 22, 21, 24], the goal is to determine the smallest rate as a function of the sparsity level , the dimension , the sample size , and the function space such that consistent detection of the alternative against the null is possible.
Though to a much lesser extent than the corresponding estimation theory, optimal testing rates have been established in various high dimensional settings other than sparse additive models. The most canonical setup, the Gaussian sequence model, is perhaps the most studied [24, 2, 41, 34, 30, 8, 14, 10, 7, 12, 18]. Optimal rates have also been established in linear regression [1, 26, 37] and other models [36, 13]. A common motif is that optimal testing rates exhibit different phenomena from optimal estimation rates.
Returning to (2)-(3), the only directly relevant works in the literature are Ingster and Lepski’s article [25] and a later article by Gayraud and Ingster [17]. Ingster and Lepski [25] consider a sparse multichannel model which, after a transformation to sequence space, is closely related to (2)-(3). Adopting an asymptotic setting and exclusively choosing to be a Sobolev space, they establish asymptotic minimax separation rates. However, their results only address the regimes and for a constant . Their result does not precisely pin down the rate near the phase transition . In essence, their testing procedure in the sparse regime is to apply a Bonferroni correction to a collection of -type tests, one test per each of the coordinates. Thus, a gap in their rate near is unsurprising. In the dense regime, a single -type test is used, as is typical in minimax testing literature. Ingster and Lepski [25] also address adaptation to sparsity as well as adaptation both the sparsity and the smoothness. Ingster and Gayraud [17] consider sparse additive models rather than a sparse multichannel model but make the same choice of and work in an asymptotic setup. They establish sharp constants for the sparse case via a Higher Criticism type testing statistic. Throughout the paper, we make comparisons of our results to primarily [25] as it was the first article to establish rates. Our results do not address the question of sharp constants.
Our paper’s main contributions are the following. First, adopting a nonasymptotic minimax testing framework as initiated in [2], we establish the nonasymptotic minimax separation rate for (2)-(3) for any configuration of the sparsity , dimension , sample size , and a generic function space . Notably, we do not restrict ourselves to Sobolev (or Besov) space as in [25, 17]. The test procedure we analyze involves thresholded statistics, following a strategy employed in other sparse signal detection problems [10, 30, 8, 9, 34].
Our second contribution is to establish an adaptive testing rate for a generic function space. Typically, the sparsity level is unknown, and it is of practical interest to have a methodology which can accommodate a generic . Interestingly, some choices of do not involve any cost of adaptation, that is, the minimax rate can be achieved without knowing the sparsity. Our rate’s incurred adaptation cost turns out to be a delicate function of , thus extending Ingster and Lepski’s focus on Sobolev spaces [25]. Even in the Sobolev case, our result extends upon their article; near the regime , our test provides finer detail by incurring a cost involving only instead of as incurred by their test. In principle, our result can be used to reverse the process and find a space for which this adaptive rate incurs a given adaptation cost.
Finally, adaptation to both sparsity and smoothness is studied in the context of Sobolev space. We identify an issue with and correct a claim made by [25].
1.1 Notation
The following notation will be used throughout the paper. For , let . For , denote and . Denote to mean there exists a universal constant such that . The expression means . Further, means and . The symbol denotes the usual inner product in Euclidean space and denotes the Frobenius inner product. The total variation distance between two probability measures and on a measurable space is defined as . If is absolutely continuous with respect to , the -divergence is defined as . For a finite set , the notation denotes the cardinality of . Throughout, iterated logarithms will be used (e.g. expressions like and ). Without explicitly stating so, we will take such an expression to be equal to some universal constant if otherwise it would be less than one. For example, should be understood to be equal to a universal constant when .
1.2 Setup
1.2.1 Reproducing Kernel Hilbert Space (RKHS)
Following the literature (e.g. [47]), will denote a reproducing kernel Hilbert space (RKHS). Before discussing main results, basic properties of RKHSs are reviewed [47]. Suppose is a reproducing kernel Hilbert space (RKHS) with associated inner product . There exists a symmetric function called a kernel such that for any we have (1) the function and (2) for all we have . Mercer’s theorem (Theorem 12.20 in [47]) guarantees that the kernel admits an expansion in terms of eigenfunctions , namely . To give examples, the kernel defines the first-order Sobolev space with eigenvalue decay , and the kernel exhibits eigenvalue decay (see [47] for a more detailed review).
Without loss of generality, we order the eigenvalues . The eigenfunctions are orthonormal in under the usual inner product and the inner product satisfies for .
1.2.2 Parameter space
The parameter space which will be used throughout the paper is defined in this section. Suppose is an RKHS. Recall we are interested in sparse additive signals for some sparsity pattern . Following [17, 38], for convenience we assume for all . This assumption is mild and can be relaxed; further discussion can be found in Section 5.1. Letting denote the constant function equal to one, consider that is a closed subspace of . Hence, is also an RKHS. We will put aside (along with its eigenfunctions and eigenvalues) and only work with . Let and denote its associated eigenfunctions and eigenvalues respectively. Following [2], we assume also for convenience; this can easily be relaxed.
For each subset , define
The condition on the RKHS norm enforces regularity. Define the parameter space
(4) |
for each .
Following [25, 17], it is convenient to transform (1) from function space to sequence space. The tensor product admits the orthonormal basis with for . For ease of notation, denote for and . Define the random variables
(5) |
where . The assumption for all is used here. Note by orthogonality that is a collection of independent random variables. The notation will frequently be used to denote the full matrix of coefficients. The notation will also be used to denote the th column of . For , the corresponding set of coefficients is
(6) |
Note the parameter spaces and are in correspondence, and we will frequently write and freely in the same context without comment. The understanding is is a function and is its corresponding basis coefficients. The notation , and will be used to denote expectations and probability measures with respect to the denoted parameters.
1.2.3 Problem
As described earlier, given an observation from (1) the signal detection problem (2)-(3) is of interest. The goal is to characterize the nonasymptotic minimax separation rate .
Definition 1.
The nonasymptotic minimax testing framework was initiated by Baraud [2] and characterizes (up to universal factors) the fundamental statistical limit of the detection problem. The framework is nonasymptotic in the sense that the conditions for the minimax separation rate hold for any configuration of the problem parameters.
Since , the problem can be equivalently formulated in sequence space as
(7) | ||||
(8) |
This testing problem, with the parameter space (6), is interesting its own right outside the sparse additive model and RKHS context. Indeed, the testing problem (7)-(8) is essentially a group-sparse extension of the detection problem in ellipses considered in Baraud’s seminal article [2]. In fact, this interpretation was actually our initial motivation to study the detection problem. The connection to sparse additive models was a later consideration; similar to the way in which the later article [17] considers sparse additive models when building upon the earlier, fundamental work [25] dealing with a sparse multichannel (essentially group-sparse) model. Taking the perspective of a sequence problem has a long history in nonparametric regression [24, 25, 2, 15, 27, 48, 40, 5] due to not only its fundamental connections but also its advantage in distilling the problem to its essence and dispelling technical distractions. Our results can be exclusively read (and readers are encouraged to do so) in the context of the sequence problem (7)-(8).
2 Minimax rates
We begin by describing some high-level and natural intuition before informally stating our main result in Section 2.1. Section 2.2 contains the development of some key quantities. In Sections 2.3 and 2.4, we formally state minimax lower and upper bounds respectively. In Section 2.5, some special cases illustrating the general result are discussed.
2.1 A naive ansatz
A first instinct is to look to previous results for context in an attempt to make a conjecture regarding the optimal testing rate. To illustrate how this line of thinking might proceed, consider the classical Gaussian white noise model on the unit interval in one dimension,
for . Assume lives inside the unit ball of a reproducing kernel Hilbert space and thus admits a decomposition in the associated orthonormal basis with a coefficient vector living in an ellipsoid determined by the kernel’s eigenvalues . The optimal rate of estimation is given by Birgé and Massart [4]
Baraud [2] established that the minimax separation rate for the signal detection problem
is given by
In both estimation and testing, the maximizer can be conceptualized as the correct truncation order. Specifically for testing, Baraud’s procedure [2], working in sequence space, rejects when . The data for are not used, and it is in this sense is understood as a truncation level. To illustrate these results, the rates for Sobolev spaces with smoothness are and since . By now, these nonasymptotic results of [4, 2] are well known and canonical.
Moving to the setting of sparse additive signals in the model (1), Raskutti et al. [38] derive the nonasymptotic minimax rate of estimation
While their upper bound holds for any choice of (satisfying some mild conditions) and sparsity level , they only obtain a matching lower bound when and when the unit ball of has logarithmic or polynomial scaling metric entropy. This rate obtained by Raskutti et al. [38] is pleasing. It is quite intuitive to see the term due to the sparsity in the parameter space. The term is the natural rate for estimating many univariate functions in as if sparsity pattern were known. Notably, there is no interaction between the choice of and the sparsity structure. The sparsity term is independent of and the estimation term is dimension free.
One might intuit that this lack of interaction is a general phenomenon. Instinct may suggest that signal detection in sparse additive models is also just many instances of a univariate nonparametric detection problem plus the problem of a detecting a signal which is nonzero on an unknown sparsity pattern. Framing it as two distinct problems, one might conjecture the optimal testing rate should be plus the -sparse detection rate. Collier et al. [10] provide a natural candidate for the -sparse testing rate, namely the rate for and for .
However, this is not the case as a quick glance at [25] falsifies the conjecture for the case of Sobolev . Though quick to dispel hopeful thinking, [25] expresses little about the interaction between sparsity and . Our result explicitly captures this nontrivial interaction for a generic . We show the minimax separation rate is given by
(9) |
The rate bears some resemblance to the sparse testing rate of Collier et al. [10] and the nonparametric testing rate of Baraud [2], but the combination of the two is not a priori straightforward. At this point in discussion, not enough has been developed to explain how the form of the rate arises. Various features of the rate will be commented on later on in the paper.
2.2 Preliminaries
In this section, some key pieces are defined.
Definition 2.
Define to be the quantity
Note, since , it follows . It is readily seen from (9) there are two broad regimes to consider. When , we have . In the regime , the rate is more complicated. The first regime is really a trivial regime since any signal must satisfy by virtue of . Therefore, the degenerate test which always accepts vacuously detects sparse additive signals with . Hence, the upper bound is trivially achieved. It turns out a matching lower bound can be proved which establishes that the regime is fundamentally trivial; see Section 2.3 for a formal statement.
More generally, the form (9) is useful when discussing lower bounds. A different form is more convenient when discussing upper bounds, and it is a form which is familiar in the context of [10, 11].
Definition 3.
Define to be the smallest positive integer such that
As the next lemma shows, is essentially the solution to the maximization problem in the definition of . Drawing an analogy to the result of Baraud [2] described in Section 2.1, can be conceptualized as the correct order of truncation accounting for the dimension and sparsity.
Lemma 1.
If , then .
With Lemma 1, the testing rate can be expressed as
(10) |
The condition in Lemma 1 is natural in light of the triviality which occurs when .
2.3 Lower bound
In this section, a formal statement of a lower bound on the minimax separation rate is given. Define
(11) |
First, it is shown that the testing problem is trivial if .
Proposition 1 (Triviality).
Suppose . Suppose and . If , then for any we have
Proposition 1 asserts that in order to achieve small testing error, a necessary condition is that for a sufficiently large . To see why Proposition 1 is a statement about triviality, observe that . To reiterate plainly, all potential in the model (1) live inside the ball of radius . There are essentially no functions that have detectable norm when , and so the problem is essentially trivial.
The lower bound construction is straightforward. Working in sequence space, consider an alternative hypothesis with a prior in which a draw is obtained by drawing a size subset uniformly at random and setting if and or setting otherwise. The value of determines the separation between the null and alternative hypotheses since . However, must respect some constraints to ensure is supported on the parameter space and that it is impossible to distinguish the null and alternative hypotheses with vanishing error. Observe this construction places us in a one-dimensional sparse sequence model (since for ), which is precisely the setting of [10]. From their results, it is seen we must have . To ensure is supported on the parameter space, we must have for all . Since , it follows , and so the constraint must be enforced. When , only the second condition is binding, and so the largest separation we can achieve is . Hence, the problem is trivial in this regime.
To ensure non-triviality, it will be assumed . The choice of the factor is only for convenience and is not essential. In fact, the condition would always suffice for our purposes, and the condition would also suffice for . The following theorem establishes .
Theorem 1.
Suppose and . If , then there exists depending only on such that for any we have
where is given by (11).
The lower bound is proved via Le Cam’s two point method (also known as the method of “fuzzy” hypotheses [45]) which is standard in the minimax hypothesis testing literature [24]. The prior distribution employed in the argument is a natural construction. Namely, the active set (i.e. the size set of nonzero coordinate functions) is selected uniformly at random and the corresponding nonzero coordinate functions are drawn from the usual univariate nonparametric prior employed in the literature [2, 24].
2.4 Upper bound
In this section, testing procedures are constructed and formal statements establishing rate-optimality are made. The form of the rate in (10) is a more convenient target, and should be kept in mind when reading the statements of our upper bounds.
2.4.1 Hard thresholding in the sparse regime
In this section, the sparse regime is discussed. For any and , define where the data is defined via transformation to sequence space (5). For any , define
(12) |
where
(13) |
where . Note is a conditional expectation under the null hypothesis. The random variable will be used as a test statistic. Such a statistic was first defined by Collier et al. [10], and similar statistics have been successfully employed in other signal detection problems [9, 34, 30, 8]. However, all previous works in the literature have only used this statistic with . For our problem, it is necessary to take growing . Consequently, a more refined analysis is necessary, and essentially new phenomena appear in the upper bound as we discuss later. As noted in Section 2.2, the quantity can be conceptualized of as the correct truncation order, that is, it turns out the correct choice is . The choice of is more complicated, and in fact there are two separate regimes depending on the size of . The regime in which is referred to as the “bulk” regime. The complementary regime is referred to as the “tail” regime.
Proposition 2 (Bulk).
Proposition 3 (Tail).
Set where is the universal constant from Lemma 11. Let denote the universal positive constant from Proposition 2. There exist universal positive constants and such that the following holds. Suppose and . Set
If , then there exists depending only on such that for all , we have
where . Here, is given by (12).
Propositions 3 and 2 thus imply that, for , the minimax separation rate is upper bounded by . By Lemma 1, it follows that
under the condition that . As established by Proposition 1 in Section 2.3, a condition like this is essential to avoid triviality.
Propositions 3 and 2 reveal an interesting phase transition in the minimax separation rate at the point
This phase transition phenomena is driven by the tail behavior of . Consider that under the null distribution for all , and so the statistic is the sum of independent and thresholded random variables. By a well-known lemma of Laurent and Massart [31] (also from Bernstein’s inequality up to constants), for any we have
(14) |
Roughly speaking, exhibits subgaussian-type deviation in the “bulk” and subexponential-type deviation in the “tail” . Consequently, should be intuited as a sum of thresholded subgaussian random variables when , and as a sum of thresholded subexponential random variables when .
Examining (14), in the “tail” it follows , which no longer exhibits dependence on . Analogously, in Proposition 3 the threshold is taken as and so the resulting rate exhibits no dependence on and consequently no dependence on . On the other hand, in the “bulk” it follows . Analogously, in Proposition 2 the threshold is taken as and so the resulting rate indeed depends on and thus on .
2.4.2 tests in the dense regime
The situation is less complicated in the dense regime as it suffices to use the testing statistic .
Proposition 4.
Suppose . Set
If , then there exists depending only on such that for all , we have
2.5 Special cases
Having formally stated lower and upper bounds for the minimax separation rate, it is informative to explore a number of special cases and witness a variety of phenomena. Throughout the illustrations it will be assumed as discussed earlier.
2.5.1 Sobolev
Taking the case as emblematic of Sobolev space with smoothness , we obtain the minimax separation rate
It is useful to compare to the rates obtained by Ingster and Lepski (Theorem 2 in [25]) (see also [17]) although their choice of parameter is space is slightly different from that considered in this paper. In the sparse case for a constant , their rate is
In the dense regime , their rate (Theorem 1 in [25]) is . Quick algebra verifies that our rate indeed matches Ingster and Lepski’s rate in these sparsity regimes.
In the sparse regime, the strange looking phase transition at in their rate now has a coherent explanation in view of our result. The situation corresponds to the “bulk” regime in which case from (12) behaves like a sum of thresholded subgaussian random variables. On the other side where , subexponential behavior is exhibited by . In fact, our result gives more detail beyond . Assume only (for example, is allowed now). Then the phase transition between the “bulk” and “tail” regimes actually occurs at .
2.5.2 Finite dimension
Consider a finite dimensional situation, that is for some positive integer . Function spaces exhibiting this kind of structure include linear functions, finite polynomials, and generally RKHSs based on kernels with finite rank. If , the minimax separation rate is
In the sparse regime, the phase transition between the bulk and tail regimes occurs at .
2.5.3 Exponential decay
As another example, consider exponential decay of the eigenvalues where are universal constants and . Such decay is a feature of RKHSs based on Gaussian kernels. The minimax separation rate is
In the sparse regime, the phase transition between the bulk and tail regimes occurs at . The minimax separation rate is quite close to the finite dimensional rate, which is sensible as RKHSs based on Gaussian kernels are known to be fairly “small” nonparametric function spaces [47].
3 Adaptation
Thus far the sparsity parameter has been assumed known, and the tests constructed make use of this information. In practice, the statistician is typically ignorant of the sparsity level and so it is of interest to understand whether adaptive tests can be furnished. In this section, we will establish an adaptive testing rate which accommodates a generic , and it turns out to exhibit a cost for adaptation which depends delicately on the function space.
To the best of our knowledge, Spokoiny’s article was the first to demonstrate an unavoidable cost for adaptation in a signal detection problem [41]. Later work established unavoidable costs for adaptive testing across a variety of situations (see [24] and references therein). This early work largely focused on adapting to an unknown smoothness parameter in a univariate nonparametric regression setting. More recently adaptation to sparsity in high dimensional models has been studied. In many problems, one can adapt to sparsity without cost in the rate (nor in the constant for some problems) [14, 34, 30, 26, 1, 18]. In the context of sparse additive models in Sobolev space, Ingster and Lepski [25] (see also [17]) consider adaptation, and we discuss their results in Section 3.4.
3.1 Preliminaries
To formally state our result, some slight generalizations of the concepts found in Section 2.2 are needed.
Definition 4.
For , define to be the smallest positive integer satisfying
Definition 5.
For , define to be
Note is decreasing in and is increasing in . As discussed in Section 2.2, two different forms are useful when discussing the separation rate, and Lemma 1 facilitated that discussion. The following lemma is a slight generalization and has the same purpose.
Lemma 2.
Suppose . If , then
At a high level, the central issue with adaptation lies in the selection of an estimator’s or a test’s hyperparameter (such as the bandwidth or penalty). Typically, there is some optimal choice but it requires knowledge of an unknown parameter (e.g. smoothness or sparsity) in order to pick it. In the current problem, the optimal choice of is unknown since the sparsity level is unknown.
In adaptive testing, the typical strategy is to fix a grid of different values, construct a test for each potential value of the hyperparameter, and for detection use the test which takes the maximum over this collection of tests. Typically, a geometric grid is chosen, and the logarithm of the grid’s cardinality reflects a cost paid by the testing procedure [41, 34, 25]. It turns out this high level intuition holds for signal detection in sparse additive models, but the details of how to select the grid are not direct since a generic must be accommodated. For , define
Define
(15) |
and define
(16) |
The grid will be used. It is readily seen that is finite as the crude bound is immediate. The following lemma shows that is, in essence, a fixed point.
Lemma 3.
If , then .
It turns out an adaptive testing procedure can be constructed which pays a price determined by (equivalently up to constant factors). To elaborate, assume . We will construct an adaptive test which achieves
The form of this separation rate is very similar to the minimax rate, except that the phase transition has shifted to and a cost involving is incurred. The value of can vary quite a bit with the choice of , and a few examples are illustrated in Section 3.4.
Our choice of the geometric grid yielding the fixed point characterization of Lemma 3 may not be immediately intuitive. It can be understood at a high level by noticing there are two competing factors at play. First, fix and note it can be conceptualized to represent a target cost from scanning over some grid. To achieve that target cost, we should use the truncation levels for each sparsity . Now, consider that we will scan over the geometric grid obtained from the truncation levels. If that geometric grid has cardinality which would incur a cost less than , then it would appear strange that we are scanning over a smaller grid to achieve a target cost associated to a larger grid. It is intuitive that we might try to aim for a lower target cost instead. But this changes our grid to which now has a different cardinality, and so the same logic applies to . A similar line of reasoning applies if the grid had happened to be too large. Therefore, we intuitively want to select such that has the proper cardinality to push the cost upon us. Hence, we are seeking a kind of fixed point in this sense.
3.2 Lower bound
The formulation of a lower bound is more delicate than one might initially anticipate. To illustrate the subtlety, suppose we have a candidate adaptive rate . It is tempting to consider the following lower bound formulation; for any , there exists such that for any ,
While extremely natural, this criterion has an issue. In particular, if there exists such that where denotes the minimax rate, then the above criterion is trivially satisfied. One simply lower bounds the maximum over all with the choice and then invokes the minimax lower bound. To exaggerate, the candidate rate and for satisfies the above criterion. Furthermore, from an upper bound perspective there is a trivial test which achieves this candidate rate. The absurdity from this lower bound criterion would then force us to conclude this candidate rate gives an adaptive testing rate.
To avoid such absurdities, we use a different lower bound criterion. The key issue in the formulation is that, in the domain of maximization for the Type II error, one must not include any for which the candidate rate is of the same order as the minimax rate.
Definition 6.
Suppose is a candidate adaptive rate and are a collection of sets satisfying for all and . We say satisfies the adaptive lower bound criterion with respect to if the following two conditions hold.
-
(i)
For any , there exists depending only on such that for all ,
for all and .
-
(ii)
Either
(17) or
(18) where denotes the minimax separation rate.
Intuitively, this criterion is nontrivial only when there are no sparsity levels in the chosen reference sets for which the candidate rate is the same order as the minimax rate. More explicitly, there is no sequence with along which the candidate and minimax rates match (up to constants). This criterion avoids trivialities such as the one described earlier. In Definition 6, formalizing the notion of “order” requires speaking in terms of sequences, though it may appear unfamiliar and clunky at first glance. Definition 6 is not new, but rather a direct port to the testing context of the notion of an adaptive rate of convergence on the scale of classes from [44] in the estimation setting (see also [11] for an application of this notion to linear functional estimation in the sparse Gaussian sequence model).
The condition (18) simply requires the candidate rate to match the minimax rate if (17) does not hold. In this way, the definition allows for the possibility of the minimax rate to be an adaptive rate (in those cases where no cost for adaptation needs to be paid).
We now state a lower bound with respect to this criterion. Define the candidate rate
(19) |
Here, is defined in (15). Examining this candidate, it is possible one may run into absurdities for since , meaning the candidate rate can match the minimax rate. Note here we have used that grows at most logarithmically in . Therefore, we will take (dropping the subscripts and writing for notational ease).
It follows from the fact that is increasing in that either (17) or (18) should typically hold. To see that is indeed increasing in , let us fix . Consider that for every we have
Taking maximum over on both sides yields , i.e. is increasing in . Now, consider and in order to compare the candidate rate to the minimax rate . Since , we have and . With the above display, it is intuitive that either (17) or (18) should hold in typical nonpathological cases; these conditions can be easily checked on a problem by problem basis (such as those in Section 3.4).
In preparation for the statement of the lower bound, the following set is needed
(20) |
With defined, we are ready to state the lower bound. The following result establishes condition (i) of Definition 6 for given by (19) with respect to our choice .
Theorem 2.
Suppose . Assume further . If , then there exists depending only on such that for all , we have
where is given by (19) and .
The condition is somewhat mild; for example it is satisfied when the eigenvalues satisfy a polynomial decay such as for some and positive universal constant . It is also satisfied when there is exponential decay such as for some and positive universal constants and . Sections 3.4 contains further discussion of these two cases.
The proof strategy is, at a high level, the same as that in Section 2.3 with the added complication that the prior should also randomize over the sparsity level in order to capture the added difficulty of not knowing the sparsity level. The random sparsity level is drawn by drawing a random truncation point and extracting the induced sparsity. The rest of the prior construction is similar to that in Section 2.3, but the analysis is complicated by the random sparsity level.
Finally, note Theorem 2 is a statement about a cost paid for adaptation for sparsity levels in . Nothing is asserted about smaller sparsities. It would be interesting to show whether or not the candidate also captures the cost for sparsity levels which satisfy for all (e.g. ); we leave it open for future work.
3.3 Upper bound
In this section, an adaptive test is constructed. Recall and given in (15) and (16) respectively. Define
(21) |
In the test, the following collection of geometric grids is used. Define
(22) |
The choice to take as a geometric grid is not statistically critical, but doing so is computationally advantageous.
Theorem 3.
There exist universal positive constants and such that the following holds. Let , , and be given by (15), (16), and (22) respectively. For and , set and set
If , then there exists depending only on such that for all , we have
where is given by (21). Here, is given by
and the statistic is given by (12).
As mentioned earlier, the adaptive test involves scanning over the potential values of in the geometric grid . Consequently, a cost involving is paid. Note for we also need to scan over in order to set the thresholds in the statistics properly. It turns out this extra scan does not incur a cost; cost is driven by needing to scan over .
To understand how scanning over can incur a cost, it is most intuitive to consider the need to control the Type I error when scanning over the statistics with various degrees of freedom. Roughly speaking, it is necessary to pick the threshold values such that the Type I error is smaller than some prescribed error level. By union bound and (14), consider the tail bound . To ensure the Type I error is small, this tail bound suggests the choice where the latter equivalence follows from the fact grows at most logarithmically in . Since the threshold is inflated, a signal must also have larger norm in order to be detected. The logarithmic inflation factor is a consequence of the tail.
3.4 Special cases
To illustrate how a price for adaptation depends delicately on the choice of function space , we consider a a variety of cases. Throughout, the assumption as well as the other assumptions necessary for our result are made.
3.4.1 Sobolev
Taking as emblematic of Sobolev space with smoothness , it can be shown that (recall (15) and (20)). Our upper and lower bounds assert an adaptive testing rate is given by
Ingster and Lepski [25] consider adaptation in the sparse regime and the dense regime for separately. In the sparse regime, they construct an adaptive test which is able to achieve the minimax rate; no cost is paid. In the dense rate, not only do they give a test which achieves , they also supply a lower bound (now, in our lower bound notation, can be taken) showing it cannot be improved. As seen in the above display, our test enjoys optimality in these regimes.
The reader should take care when comparing [25] to our result. In our setting, the unknown sparsity level can vary throughout the entire range . In contrast, [25] consider two separate cases. Either the unknown sparsity is constrained to , or it is constrained to . To elaborate, the precise value of is unknown but [25] assumes the statistician has knowledge of which of the two cases is in force. In contrast, we make no such assumption; nothing at all is known about the sparsity level. In our view, this situation is more faithful to the problem facing the practitioner. Despite constructed for a seemingly more difficult setting, it turns out our test is optimal under Ingster and Lepski’s setting.
Our result provides finer detail around missed by [25]. In Theorem 3 of their article, Ingster and Lepski propose an adaptive test procedure which is applicable in the regime . It requires signal strength of squared order at least
This is suboptimal since our test achieves
which is faster.
3.4.2 Finite dimension
Consider a finite dimensional structure where is a positive integer. If , it can be shown , and so the minimax separation rate can be achieved by our adaptive test. In other words, no cost is paid for adaptation.
3.4.3 Exponential decay
Consider exponential decay of the eigenvalues where . It can be shown that , and so an adaptive rate is
The cost for adaptation here grows very slowly, and is perhaps another indication of the relative “small” size of RKHSs based on Gaussian kernels (as noted in Section 2.5).
4 Adaptation to both sparsity and smoothness
So far we have assumed that the space is known. However, it is likely that is one constituent out of a collection of spaces indexed by some hyperparameter, such as a smoothness level. Typically, the true value of this hyperparameter is unknown in addition to the sparsity being unknown, and it is of interest to understand how the testing rate changes. To avoid excessive abstractness, we follow [25, 17] and adopt a working model where has of the form emblematic of Sobolev space with smoothness . To emphasize the dependence on , let us write and its corresponding space . Reiterating, we are interested in adapting to both the sparsity level and the smoothness level .
Ingster and Lepski [25] study adaptation to both sparsity and smoothness. In particular, they study adaptation for an unknown in a closed interval where the endpoints are known. As argued in [25], since is known, the sequence space basis in Section 1.2.2 for any can be taken to be -regular, and thus is known to the statistician. Furthermore, they separately study the dense regime and the sparse regime for some constant . We adopt the same setup as them.
Ingster and Lepski [25] claim the following adaptive rate. In the dense case , they make the assumption and state (Theorem 4 in [25])
In contrast111The same error affects the proofs of the lower bounds in Theorems 4 and 5 in [25]. The prior they define is not supported on the parameter space. Under their prior (see (75) in [25]), the various coordinate functions end up having different smoothness levels instead of sharing a single ., we will show the adaptive rate in the dense case is
The careful reader will note that in the special case , our answer recovers the adaptive separation rate proved by Spokoiny [41]. In the sparse case , their claimed rate (Theorem 5 in [25]) is
(23) |
As a partial correction, we prove the following lower bound
(24) |
In the regime , (23) and (24) match each other and match the minimax rate (see Section 2.5.1). However, in the case , there may be a difference between (23) and (24).
4.1 Dense
Fix and . Recall in the dense regime that . Define
(25) |
As mentioned earlier, we will be establishing this rate in the dense regime. For use in a testing procedure, define the geometric grid
(26) |
Note the statistician does not need to know nor to construct . Further note .
4.1.1 Upper bound
The test procedure employed follows the typical strategy in adaptive testing, namely constructing individual tests for potential values of in the grid , and then taking maximum over the individual tests to perform signal detection. The cost of adaptation involves the logarithm of the cardinality of (i.e. ). Since the dense regime is in force, the individual tests are tests.
4.1.2 Lower bound
The following theorem states the lower bound. Note that it satisfies Definition 6 with the straightforward modification to incorporate adaptation over . In particular, the potential difficulty with absurdities outlined in Section 3.2 do not arise since the candidate rate
is never of the same order as the minimax rate .
Theorem 5.
Fix and . If , then there exists depending only on such that for all , we have
where is given by (25).
4.2 Sparse
Fix and . Recall that in the sparse regime we have . Define
(27) |
Ingster and Lepski [25] give a testing procedure achieving (23) and thus establish the upper bound. As noted earlier, when , not only do (27) and (23) match each other but they also match the minimax rate (see Section 2.5.1). In the regime , it is not clear from a lower bound perspective whether a cost for adaptation is unavoidable. We complement their upper bound by providing a lower bound which demonstrates that it is indeed necessary to pay a cost for adaptation. However, the cost we identify may not be sharp. To elaborate, in the case where is a large universal constant but much smaller than , the sparse regime rate (27) involves . From Spokoiny’s article [41], is instead expected. We leave it for future work to pin down the sharp cost.
Theorem 6.
Fix and . If , then there exists depending only on such that for all , we have
where is given by (27).
5 Discussion
5.1 Relaxing the centered assumption
In Section 1.2.2, in the definition of the parameter space (4) the signal was constrained to be centered. It was claimed this constraint is mild and can be relaxed; this section elaborates on this point.
Suppose is an additive function (possibly uncentered) with each . Let and note we can write where is the constant function equal to one. Since is an additive function, for any we have where with . In other words, we have where itself is an additive function with each .
Asserting is a sparse additive function implies is a (centered) sparse additive function. Following [38, 17], we impose constraints (i.e. RKHS norm constraints) on the centered component functions . Therefore, the following uncentered signal detection problem can be considered,
(28) | ||||
(29) |
where is given by (4). Let denote the minimax separation rate. Consider by orthogonality. Thus, there are two related detection problems, Problem I
and Problem II
Let and denote the minimax separation rates of the respective problems. We claim . From it is directly seen that , and so the lower bound is proved. To show the upper bound, consider that if , then by triangle inequality it follows that either or . Thus, there is enough signal to detect and the test which takes maximum over the subproblem minimax tests is optimal. Hence, the upper bound is now also proved.
All that remains is to determine the minimax rates and . Let us index starting from zero and take and to be the associated eigenfunctions and eigenvalues of the RKHS . Recall forms an orthonormal basis for under the usual inner product. Since by definition is orthogonal to the span of the constant function in , without loss of generality we can take to be the constant function equal to one, , and to be the remaining eigenfunctions orthogonal (in ) to constant functions.
The data (defined in (5)) is sufficient for Problem II. Therefore, the minimax rate is exactly given by our result established in this paper. The minimax rate can be upper bounded in the following manner. Consider that since . Therefore, the parametric rate can be achieved. Hence, . In other words, centering is immaterial to the fundamental limits of the signal detection problem.
5.2 Sharp constant
As we noted earlier, our results do not say anything about the sharp constant in the detection boundary. The problem of obtaining a sharp characterization of the constants in the detection boundary is interesting and likely delicate. In an asymptotic setup and in the Sobolev case, Gayraud and Ingster [17] were able to derive the sharp constant in the sparse regime for fixed under the condition . Gayraud and Ingster discuss that this condition is actually essential, in that the detection boundary no longer exhibits a dependence on when the condition is violated. This condition has a nice formulation in our notation, namely it is equivalent to the condition
This correspondence in the Sobolev case suggests this condition may actually be essential for a generic . It would be interesting to understand if that is true, and to derive the sharp constant when it holds if it so.
To be clear, it may be the case (perhaps likely) our proposed procedure is suboptimal in terms of the constant. Indeed, the existing literature on sparse signal detection, both in sparse sequence model [14, 18] and sparse linear regression [26, 1] rely on Higher Criticism type tests to achieve the optimal constant. Gayraud and Ingster [17] themselves use Higher Criticism. For a generic space , our procedure should not be understood as the only test which is rate-optimal. In the sparsity regime , we suspect an analogous Higher Criticism type statistic which accounts for the eigenstructure of the kernel might not only achieve the optimal rate, but also the sharp constant.
5.3 Future directions
There are a number of avenues for future work. First, we only considered one space in which all component functions live in. In some scenarios, it may be desirable to consider a different space for each component. Raskutti et al. [38] obtained the minimax estimation rate when considering multiple kernels (under some conditions on the kernels). We imagine our broad approach in this work could be extended to determine the minimax separation rate allowing multiple kernels. Instead of a common , it is likely different quantities will be needed per coordinate and the test statistics could be modified in the natural way. The theory developed here could be used directly, and it seems plausible the minimax separation rate could be established in a straightforward manner.
Another avenue of research involves considering “soft” sparsity in the form of constraints for . Yuan and Zhou [50] developed minimax estimation rates for RKHSs exhibiting polynomial decay of its eigenvalues (e.g. Sobolev space). In terms of signal detection, its plausible that the quadratic functional estimator under the Gaussian sequence model with bound on the mean could be extended and used as a test statistic [10]. The hard sparsity and soft sparsity settings studied in [10] are handled quite similarly. It is possible not much additional theory needs to be developed in order to obtain minimax separation rates under soft sparsity.
Since hypothesis testing is quite closely related to functional estimation in many problems, it is natural to ask about functional estimation in the context of sparse additive models. For example, it would be of interest to estimate the quadratic functional or the norm . It is well known in the nonparametric literature that estimating norms for odd yields drastically different rates from testing and from even [32, 19]. A compelling direction is to investigate the same problem in the sparse additive model setting.
Additionally, it would be interesting to consider the nonparametric regression model where the design distribution exhibits some dependence between the coordinates. The correspondence between the white noise model (1) and the nonparametric regression model we relied on requires that the design distribution be close to uniform on . However, in practical situations it is typically the case that the coordinates of exhibit dependence, and it would be interesting to understand how the fundamental limits of testing are affected.
Finally, in a somewhat different direction than that discussed so far, it is of interest to study the signal detection problem under group sparsity in other models. As we had encouraged, our results can be interpreted exclusively in terms of sequence space, that is the problem (7)-(8) with parameter space (6). From this perspective, the group sparse structure is immediately apparent. Estimation has been extensively studied in group sparse settings, especially in linear regression (see [47] and references therein). Hypothesis testing has not witnessed nearly the same level of research activity, and so there is ample opportunity. We imagine some features of the rates established in this paper are general features of the group sparse structure, and it would be intriguing to discover the commonalities.
6 Acknowledgments
SK is grateful to Oleg Lepski for a kind correspondence. The research of SK is supported in part by NSF Grants DMS-1547396 and ECCS-2216912. The research of CG is supported in part by NSF Grant ECCS-2216912, NSF Career Award DMS-1847590, and an Alfred Sloan fellowship.
7 Proofs
7.1 Minimax upper bounds
7.1.1 Sparse
Proof of Proposition 2.
Fix . We will set at the end of the proof, so for now let . The universal constant will be selected in the course of the proof. Let denote the universal constant from Lemma 21. Set and where where and are the universal constants in the exponential terms of Lemmas 15 and 14 respectively.
We first bound the Type I error. Since , we have . Therefore, we can apply Lemma 15 to obtain that
for any . Here, is a universal constant. Taking and noting that provided we select , we see that
where we have used that and where we have selected . Note we have also used that . Thus, with these choices of and , we have
provided we select .
We now examine the Type II error. To bound the Type II error, we will use Chebyshev’s inequality. In particular, consider that for any with , we have
(30) |
provided that . To ensure this application of Chebyshev’s inequality is valid, we must compute suitable bounds for the expectation and variance of , which the following lemmas provide.
Lemma 4.
If is larger than some sufficiently large universal constant, then .
Lemma 5.
If is larger than some sufficiently large universal constant, then where is a universal constant.
These bounds are proved later on. Let us now describe why they allow us to bound the Type II error. From (30) as well as Lemmas 4 and 5, we have
provided we pick larger than a sufficiently large universal constant as required by Lemmas 4 and 5. With this bound in hand and since , we can now pick sufficiently large depending only on to obtain
To summarize, we can pick depending only on to be sufficiently large and also satisfying the condition and those of Lemmas 4, 5 to ensure the testing risk is bounded by , i.e.
as desired. ∎
It remains to prove Lemmas 4 and 5. Recall we work in the environment of the proof of Proposition 2.
Proof of Lemma 4.
To prove the lower bound on the expectation, first recall
Under , we have where . Here, the collection denotes the basis coefficients of . Intuitively, there are two reasons why the expectation might be small. First, we are thresholding and so we are intuitively removing those coordinates with small, but nonetheless nonzero, means. Furthermore, we are truncating at level and not considering higher-order basis coefficients; this also incurs a loss in signal.
Let us first focus on the effect from thresholding. Let denote the universal constant from Lemma 13. Applying Lemma 13 yields
where denotes the subset of active variables such that . Note that such exists and since . We have thus bounded the amount of signal lost from thresholding.
Let us now examine the effect of truncation. Consider that
Here, we have used for all . Therefore, we have shown , and thus have quantified the loss due to truncation.
We are now in position to put together the two pieces. Consider . Consequently,
We have used the decreasing order of the kernel’s eigenvalues, i.e. that implies . Further, consider that by definition of and , we have
With this in hand, it follows that . To summarize, we have shown
(31) |
Here, we have used . We can also conclude
In view of the above bound and since , it suffices to pick large enough to satisfy to ensure . The proof of the lemma is complete. ∎
Proof of Lemma 5.
To bound the variance of , recall that , and so . Therefore, we can apply Lemma 14. By Lemma 14, we have
where is a positive universal constant whose value can change from instance to instance. Recall is defined at the beginning of the proof of Proposition 2. Since and , we have
Therefore,
Taking larger than a sufficiently large universal constant, noting , and invoking (31), we have the desired result. ∎
Proof of Proposition 3.
Our proof is largely the same as in the proof of Proposition 2, except we invoke results about the “tail” rather than the “bulk”. Fix . We will make a choice of at the end of the proof, so for now let . We select in the course of the proof, but we will select now. Set with where and are the universal constants in the exponential terms of Lemmas 11 and 10 respectively.
We first bound the Type I error. Since , we have . Since , we have by Lemma 11 that for any ,
where is a universal positive constant. Taking and noting provided we have chosen , we see that
where we have used that and we have set . Thus, with these choices of and we have
provided we select .
We now examine the Type II error. To bound the Type II error, we will use Chebyshev’s inequality. In particular, consider that for any with , we have
(32) |
provided that . To ensure this application of Chebyshev’s inequality is valid and to bound the Type II error, we will need a lower bound on the expectation of . We will also need an upper bound on the variance of in order to bound the Type II error. The following lemmas provide us with the requisite bounds; they are analogous to Lemmas 4 and 5 but are now in the context of the tail regime.
Lemma 6.
If is larger than some sufficiently larger universal constant, then .
Lemma 7.
If is larger than some sufficiently large universal constant, then where is a universal constant.
Proof of Lemma 6.
The proof is similar in style to the proof of Lemma 4, except now results for the tail regime are invoked. Letting denote the universal constant from Lemma 9, applying Lemma 9, and arguing similarly to the proof of Lemma 4 , we obtain
where denotes the subset of active variables such that . Note that such exists and since . Further arguing like the proof of Lemma 4 and using , we have
(33) |
To summarize, we have shown
(34) |
Note that we also can conclude
Since , it suffices to pick large enough to satisfy to ensure . The proof of the lemma is complete. ∎
Proof of Lemma 7.
Recall that . By definition of we have . In other words and so we can apply Lemma 10. By Lemma 10, we have
where is a positive universal constant. Recall we had defined at the beginning of the proof of Proposition 3. Since , we have . Therefore,
where we have used that by definition of . Therefore,
Taking larger than a sufficiently large universal constant and invoking (7.1.1) yields the desired result. ∎
7.1.2 Dense
Proof of Proposition 4.
For ease of notation, set . Define . Let . We first bound the Type I error. Consider that under we have . Observe that and . Consequently, we have by Chebyshev’s inequality
We now examine the Type II error. Under , we have where denote the basis coefficients associated to . Note and . In order to proceed with the argument, we need to obtain a lower bound estimate for the signal strength. Letting denote the set of for which are nonzero, consider that for we have
We have used for all in the final line. Therefore by definition of we have
where we have used that .
We now continue with bounding the Type II error. By Chebyshev’s inequality, we have
Therefore, the sum of Type I and Type II errors is bounded by as desired. ∎
7.2 Minimax lower bounds
Proof of Proposition 1.
We break up the analysis into two cases.
Case 1: Suppose . We will construct a prior distribution on and use Le Cam’s two point method to furnish a lower bound. Define . Let . Let be the prior in which a draw is constructed by uniformly drawing of size and setting
where is given by . Note that implies and . By Neyman-Pearson lemma and the inequality , we have
(35) |
where denotes the mixture induced by . By the Ingster-Suslina method (Proposition 5) and Corollary 3, we have
where and are the corresponding random sets. Now, using that , we have
We have also used that and the inequality for and . Plugging into (35) yields the desired result.
Case 2: Suppose . We can repeat the analysis done in Case 1 with the modification of replacing every instance of with . ∎
Proof of Theorem 1.
We will construct a prior distribution on and use Le Cam’s two point method to furnish a lower bound. Set . Let . We break up the analysis into two cases.
Case 1: Suppose we are in the regime where . Set . Let be the prior in which a draw is obtained by uniformly drawing of size and drawing
where denotes the probability measure placing full mass at zero. Note that implies . Here, we have used that since . Furthermore, consider that for , we have
Here, we have used the ordering of the eigenvalues and by Lemma 25. Of course, for we have . Hence we have . We then have by the Neyman-Pearson lemma and the inequality that
(36) |
For the following calculations, let and are the corresponding random sets. Also, let denote an iid collection of random variables. By the Ingster-Suslina method (Proposition 5), independence of with , and Corollary 3, we have
Consider by Lemma 1 that . Therefore,
We have used and the inequality for and . Using with (36) yields the desired result.
7.3 Adaptive lower bound
Proof of Theorem 2.
Recall the definitions of in (15) and in (20). By assumption, we have for a universal constant . For ease of notation, let us denote where is given by (19). Set and let . We now define a prior . A draw is obtained as follows. First, draw . Set . Then draw uniformly at random a subset of size exactly . Let satisfy . Draw independently
Here, . This description concludes the definition of .
Now we must show that is indeed supported on . Note that when we draw , the associated sparsity level always satisfies . Furthermore, conditional on we have
Furthermore, consider that for we have
where the last line follows from . Note we have used the ordering of the eigenvalues as well as by Lemma 26. Of course, for we have . Hence, and so is properly supported.
Writing for the mixture, we have
(37) |
For the following calculations, let . Let be the corresponding quantities. Let and denote the corresponding sparsities. Denote the corresponding support sets and . Also, let denote an iid collection of random variables which is independent of and . For ease of notation, let . Likewise, let . By the Ingster-Suslina method (Proposition 5), we have
Here, we have used the inequality for . Consider that by Lemma 26
We have used along with the inequality for . Therefore,
(38) |
Since , we can use Lemma 28 and the inequality for and to obtain
Recall we write and . Moreover, recall . Now observe
Recall we have . Define the sets
Then
Let us examine the second term. Note that for any we have for all . Using this, we obtain
The final line results from . Note we have also used to obtain the penultimate line. We now examine . Consider that . Therefore,
Note we have used . We have also used along with the inequality for all to obtain the penultimate line. Therefore, we have shown
Plugging into (37) yields
The proof is complete. ∎
7.4 Adaptive upper bound
Proof of Theorem 3.
Fix . In various places of the proof, we will point out that can be taken sufficiently large to obtain desired bounds, so for now let . Let denote the universal constant from Lemma 21. Let denote the maximum of the corresponding universal constants from Lemmas 15 and 11. Set
Here, with and being the universal constants in the exponential terms of Lemmas 15 and 14 respectively. Likewise, where and are the universal constants in the exponential terms of Lemmas 11 and 10. Note that these choices of are almost identical to the choices in the proofs of Propositions 2 and 3. The only modifications are the terms and , and the utility of this modification will become clear through the course of the proof.
For any , define
We examine the Type I and Type II errors separately. Focusing on the Type I error, union bound yields
(39) |
We bound each term separately.
Type I error: Bulk
For ,
Consider that Lemma 15 gives us the following. If we select satisfying
(40) |
then we have . Let us select
To verify (40), consider that by our choice of , and so
Here, we have used that . We have also used , which holds provided we select large enough (i.e. ).
Likewise, consider
Therefore, (40) is satisfied, and so we have
where is the universal constant . With this bound in hand, observe
Here we have used from Lemma 3. We bound each term separately. First, consider
provided we take sufficiently large.
Likewise, consider
again, provided we take sufficiently large. We have used that exhibits at most logarithmic growth in , i.e. we use the crude bound . Therefore, we have established
(41) |
Type I error: Tail
For , we have
Consider that Lemma 11 gives us an analogue to (40), namely if we select satisfying
(42) |
then we have . Let us select
To verify (42), consider that by our choice of , and so
Here, we have used . Likewise, consider that
Therefore, (42) is satisfied, and so we have
where is the universal constant . With this bound in hand, observe
Here we have used from Lemma 3. From here, the same argument from the bulk case can be employed to conclude
(43) |
provided is taken sufficiently large.
Type I error: Dense
Let us now bound . Consider that for any , we have
To obtain the fourth line, we have used which implies . In the above display, we have also used (which holds provided we take large enough) to obtain the penultimate line and Lemma 16 to obtain the final line. Consequently,
(44) |
since by Lemma 3 and , which holds provided we take sufficiently large.
Type II error: We now examine the Type II error. Suppose . Let with . We proceed by considering various cases.
Type II error: Dense
Suppose . Let denote the smallest element in which is greater than or equal to . Then
Consider that
where the collection denotes the collection of basis coefficients for . We will also use the matrix to denote the collection of basis coefficients; note that . Let denote the set of for which are nonzero. Observe
Therefore,
We have used that to obtain the third line. Therefore, we have
The signal magnitude satisfies the requisite strength needed to successfully detect, which can be seen by following the argument of Proposition 4. Hence, we can achieve Type II error less than or equal to by selecting sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by , as desired.
Type II error: Bulk
Suppose and . Let be the smallest element in greater than or equal to . Let be the smallest element in greater than or equal to . By the definitions of these grids, we have and . Then
With these choices, we have
Note that since , we have . Following argument similar to those in the proof of Proposition 2, it can be seen that the necessary signal strength to successfully detect is of squared order
This is precisely the order of , and so we can obtain Type II error less than by choosing sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by , as desired.
Type II error: Tail
Suppose and . Let and be as defined in the previous case, i.e. Type II error analysis for the bulk case. As before, we have
Now, consider that and so .
Suppose we have
We can follow arguments similar to those in the proof of Proposition 3 to see that the necessary signal strength to successfully detect is of squared order
This is precisely the order of and so we can obtain Type II error less than by choosing sufficiently large.
On the other hand, suppose we have
As argued in the previous section about the bulk regime, the necessary signal strength to successfully detect is of squared order
Consider that
and so we can obtain Type II error less than by choosing sufficiently large. We can combine this bound with (7.4) to conclude that the total testing risk is bounded above by , as desired. ∎
7.5 Smoothness and sparsity adaptive lower bounds
Proof of Theorem 5.
Fix . Fix any . Through the course of the proof, we will note where we must take suitably small enough, so for now let . For each , define the geometric grid
Note .
We now define a prior which is supported on the alternative hypothesis. Let . Let denote the solution to
Note that for any . Note is random since is random. Draw uniformly at random a subset of size . Draw independently
Here, is given by . Having defined , the definition of the prior is complete.
Now we must show is indeed supported on the alternative hypothesis. Consider that for , we have
Furthermore, consider that for we have by definition of ,
We can take to ensure . Of course, for we have . Hence, we have shown with probability one, and so has the proper support.
Writing to denote the mixture induced by the prior , we have
For the following calculations, let . Let be the corresponding quantities and let denote the corresponding smoothness levels. Further let denote the corresponding support sets. Note both are of size . Let denote an iid collection of random variables which is independent of all the other random variables. By the Ingster-Suslina method (Proposition 5), we have
From the definition of , we have
Here, we have used since and . We have also used the inequality for . With this in hand, it follows that
Noting that we can write , and that , we can follow the same steps as in the proof of Proposition 4.2 in [16]. Taking suitably small, we obtain which yields
as desired. ∎
Proof of Theorem 6.
Some simplification is convenient. Note can be rewritten as
Note that in the case , we have . Therefore, and so we can further simplify
(46) |
Writing in the form (46) is convenient. The lower bound is exactly the minimax lower bound and so no new argument is needed. All that needs to be proved is the lower bound .
The proof is very similar to that of the proof of Theorem 5, so we only point out the modifications in the interest of brevity. Fix any . Let be the prior from the proof of Theorem 5, but use given by (47) and given by (48), defined below, instead. Define the geometric grid
(47) |
Let denote the solution to
(48) |
Note for any . Also, use
It can be checked in the same manner that with these modifications is properly supported on the alternative hypothesis. We can continue along as in the proof of Theorem 5 up until the point we have
From here, we use the fact that implies . In other words, there exists a constant such that
Then by taking sufficiently small, the rest of the proof of Theorem 5 can be carried out to obtain the desired result. ∎
7.6 Smoothness and sparsity adaptive upper bounds
Proof of Theorem 4.
Fix . For ease, let us just write . We will note throughout the proof where we take suitably large, so for now let . We will also note where we take suitably large. We will also note where we take suitably large. We first examine the Type I error. By union bound and taking ,
for some universal positive constant . Here, we have used . Taking suitably large, the Type I error is suitably bounded
We now examine the Type II error. Fix any , , and with . Set
Let be the smallest element larger than . Note by definition of .
Consider that since and , it immediately follows that grows polynomially in , and so . Therefore, we have for some universal positive constant . Then by Chebyshev’s inequality
Consider that
Therefore, taking sufficiently large, we have
Hence, the Type II error is bounded suitably. Since was arbitrary, we have shown that
as desired.
∎
References
- [1] Ery Arias-Castro, Emmanuel J. Candès and Yaniv Plan “Global testing under sparse alternatives: ANOVA, multiple comparisons and the higher criticism” In Ann. Statist. 39.5, 2011, pp. 2533–2556 DOI: 10.1214/11-AOS910
- [2] Yannick Baraud “Non-asymptotic minimax rates of testing in signal detection” In Bernoulli 8.5, 2002, pp. 577–606
- [3] Sohom Bhattacharya, Jianqing Fan and Debarghya Mukherjee “Deep neural networks for nonparametric interaction models with diverging dimension” In Ann. Statist., Forthcoming
- [4] Lucien Birgé and Pascal Massart “Gaussian model selection” In J. Eur. Math. Soc. (JEMS) 3.3, 2001, pp. 203–268 DOI: 10.1007/s100970100031
- [5] Lawrence D. Brown and Mark G. Low “Asymptotic equivalence of nonparametric regression and white noise” In Ann. Statist. 24.6, 1996, pp. 2384–2398 DOI: 10.1214/aos/1032181159
- [6] M.. Burnasev “Minimax detection of an imperfectly known signal against a background of Gaussian white noise” In Teor. Veroyatnost. i Primenen. 24.1, 1979, pp. 106–118
- [7] Alexandra Carpentier and Nicolas Verzelen “Adaptive estimation of the sparsity in the Gaussian vector model” In Ann. Statist. 47.1, 2019, pp. 93–126 DOI: 10.1214/17-AOS1680
- [8] Julien Chhor, Rajarshi Mukherjee and Subhabrata Sen “Sparse signal detection in heteroscedastic Gaussian sequence models: sharp minimax rates” In Bernoulli 30.3, 2024, pp. 2127–2153 DOI: 10.3150/23-bej1667
- [9] O. Collier, L. Comminges and A.. Tsybakov “On estimation of nonsmooth functionals of sparse normal means” In Bernoulli 26.3, 2020, pp. 1989–2020 DOI: 10.3150/19-BEJ1180
- [10] Olivier Collier, Laëtitia Comminges and Alexandre B. Tsybakov “Minimax estimation of linear and quadratic functionals on sparsity classes” In Ann. Statist. 45.3, 2017, pp. 923–958 DOI: 10.1214/15-AOS1432
- [11] Olivier Collier, Laëtitia Comminges, Alexandre B. Tsybakov and Nicolas Verzelen “Optimal adaptive estimation of linear functionals under sparsity” In Ann. Statist. 46.6A, 2018, pp. 3130–3150 DOI: 10.1214/17-AOS1653
- [12] L. Comminges, O. Collier, M. Ndaoud and A.. Tsybakov “Adaptive robust estimation in sparse vector model” In Ann. Statist. 49.3, 2021, pp. 1347–1377 DOI: 10.1214/20-aos2002
- [13] Nabarun Deb, Rajarshi Mukherjee, Sumit Mukherjee and Ming Yuan “Detecting structured signals in Ising models” In Ann. Appl. Probab. 34.1A, 2024, pp. 1–45 DOI: 10.1214/23-aap1929
- [14] David Donoho and Jiashun Jin “Higher criticism for detecting sparse heterogeneous mixtures” In Ann. Statist. 32.3, 2004, pp. 962–994 DOI: 10.1214/009053604000000265
- [15] M.. Ermakov “Minimax detection of a signal in Gaussian white noise” In Teor. Veroyatnost. i Primenen. 35.4, 1990, pp. 704–715 DOI: 10.1137/1135098
- [16] Chao Gao, Fang Han and Cun-Hui Zhang “On estimation of isotonic piecewise constant signals” In Ann. Statist. 48.2, 2020, pp. 629–654 DOI: 10.1214/18-AOS1792
- [17] Ghislaine Gayraud and Yuri Ingster “Detection of sparse additive functions” In Electron. J. Stat. 6, 2012, pp. 1409–1448 DOI: 10.1214/12-EJS715
- [18] Peter Hall and Jiashun Jin “Innovated higher criticism for detecting sparse signals in correlated noise” In Ann. Statist. 38.3, 2010, pp. 1686–1732 DOI: 10.1214/09-AOS764
- [19] Yanjun Han, Jiantao Jiao and Rajarshi Mukherjee “On estimation of -norms in Gaussian white noise models” In Probab. Theory Related Fields 177.3-4, 2020, pp. 1243–1294 DOI: 10.1007/s00440-020-00982-x
- [20] Trevor Hastie and Robert Tibshirani “Generalized additive models” In Statist. Sci. 1.3, 1986, pp. 297–318
- [21] Yu.. Ingster “Adaptive tests for minimax testing of nonparametric hypotheses” In Probability theory and mathematical statistics, Vol. I (Vilnius, 1989) “Mokslas”, Vilnius, 1990, pp. 539–549
- [22] Yu.. Ingster “Asymptotically minimax testing of nonparametric hypotheses” In Probability theory and mathematical statistics, Vol. I (Vilnius, 1985) VNU Sci. Press, Utrecht, 1987, pp. 553–574
- [23] Yu.. Ingster “Minimax nonparametric detection of signals in white Gaussian noise” In Problemy Peredachi Informatsii 18.2, 1982, pp. 61–73
- [24] Yu.. Ingster and I.. Suslina “Nonparametric Goodness-of-Fit Testing Under Gaussian Models” 169, Lecture Notes in Statistics Springer-Verlag, New York, 2003 DOI: 10.1007/978-0-387-21580-8
- [25] Yuri Ingster and Oleg Lepski “Multichannel nonparametric signal detection” In Math. Methods Statist. 12.3, 2003, pp. 247–275
- [26] Yuri I. Ingster, Alexandre B. Tsybakov and Nicolas Verzelen “Detection boundary in sparse regression” In Electron. J. Stat. 4, 2010, pp. 1476–1526 DOI: 10.1214/10-EJS589
- [27] Iain M Johnstone “Gaussian estimation: Sequence and wavelet models”
- [28] Iain M. Johnstone “Chi-square oracle inequalities” In State of the art in probability and statistics (Leiden, 1999) 36, IMS Lecture Notes Monogr. Ser. Inst. Math. Statist., Beachwood, OH, 2001, pp. 399–418 DOI: 10.1214/lnms/1215090080
- [29] Vladimir Koltchinskii and Ming Yuan “Sparsity in multiple kernel learning” In Ann. Statist. 38.6, 2010, pp. 3660–3695 DOI: 10.1214/10-AOS825
- [30] Subhodh Kotekal and Chao Gao “Minimax rates for sparse signal detection under correlation” In Inf. Inference 12.4, 2023, pp. Paper No. iaad044\bibrangessep97 DOI: 10.1093/imaiai/iaad044
- [31] B. Laurent and P. Massart “Adaptive estimation of a quadratic functional by model selection” In Ann. Statist. 28.5, 2000, pp. 1302–1338 DOI: 10.1214/aos/1015957395
- [32] O. Lepski, A. Nemirovski and V. Spokoiny “On estimation of the norm of a regression function” In Probab. Theory Related Fields 113.2, 1999, pp. 221–253 DOI: 10.1007/s004409970006
- [33] Yi Lin and Hao Helen Zhang “Component selection and smoothing in multivariate nonparametric regression” In Ann. Statist. 34.5, 2006, pp. 2272–2297 DOI: 10.1214/009053606000000722
- [34] Haoyang Liu, Chao Gao and Richard J. Samworth “Minimax rates in sparse, high-dimensional change point detection” In Ann. Statist. 49.2, 2021, pp. 1081–1112 DOI: 10.1214/20-aos1994
- [35] Lukas Meier, Sara Geer and Peter Bühlmann “High-dimensional additive modeling” In Ann. Statist. 37.6B, 2009, pp. 3779–3821 DOI: 10.1214/09-AOS692
- [36] Rajarshi Mukherjee and Gourab Ray “On testing for parameters in Ising models” In Ann. Inst. Henri Poincaré Probab. Stat. 58.1, 2022, pp. 164–187 DOI: 10.1214/21-aihp1157
- [37] Rajarshi Mukherjee and Subhabrata Sen “On minimax exponents of sparse testing” arXiv:2003.00570 [math, stat] arXiv, 2020 DOI: 10.48550/arXiv.2003.00570
- [38] Garvesh Raskutti, Martin J. Wainwright and Bin Yu “Minimax-optimal rates for sparse additive models over kernel classes via convex programming” In J. Mach. Learn. Res. 13, 2012, pp. 389–427
- [39] Pradeep Ravikumar, John Lafferty, Han Liu and Larry Wasserman “Sparse additive models” In J. R. Stat. Soc. Ser. B Stat. Methodol. 71.5, 2009, pp. 1009–1030 DOI: 10.1111/j.1467-9868.2009.00718.x
- [40] Markus Reiß “Asymptotic equivalence for nonparametric regression with multivariate and random design” In Ann. Statist. 36.4, 2008, pp. 1957–1982 DOI: 10.1214/07-AOS525
- [41] V.. Spokoiny “Adaptive hypothesis testing using wavelets” In Ann. Statist. 24.6, 1996, pp. 2477–2498 DOI: 10.1214/aos/1032181163
- [42] N.. Temme “The asymptotic expansion of the incomplete gamma functions” In SIAM J. Math. Anal. 10.4, 1979, pp. 757–766 DOI: 10.1137/0510071
- [43] N.. Temme “Uniform asymptotic expansions of the incomplete gamma functions and the incomplete beta function” In Math. Comp. 29.132, 1975, pp. 1109–1114 DOI: 10.2307/2005750
- [44] A.. Tsybakov “Pointwise and sup-norm sharp adaptive estimation of functions on the Sobolev classes” In Ann. Statist. 26.6, 1998, pp. 2420–2469 DOI: 10.1214/aos/1024691478
- [45] Alexandre B. Tsybakov “Introduction to Nonparametric Estimation”, Springer Series in Statistics Springer, New York, 2009 DOI: 10.1007/b13794
- [46] Roman Vershynin “High-Dimensional Probability” 47, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2018 DOI: 10.1017/9781108231596
- [47] Martin J. Wainwright “High-Dimensional Statistics” 48, Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, Cambridge, 2019 DOI: 10.1017/9781108627771
- [48] Yuting Wei and Martin J. Wainwright “The local geometry of testing in ellipses: tight control via localized Kolmogorov widths” In IEEE Trans. Inform. Theory 66.8, 2020, pp. 5110–5129 DOI: 10.1109/TIT.2020.2981313
- [49] Yun Yang and Surya T. Tokdar “Minimax-optimal nonparametric regression in high dimensions” In Ann. Statist. 43.2, 2015, pp. 652–674 DOI: 10.1214/14-AOS1289
- [50] Ming Yuan and Ding-Xuan Zhou “Minimax optimal rates of estimation in high dimensional additive models” In Ann. Statist. 44.6, 2016, pp. 2564–2593 DOI: 10.1214/15-AOS1422
- [51] Anru R. Zhang and Yuchen Zhou “On the non-asymptotic and sharp lower tail bounds of random variables” In Stat 9, 2020, pp. e314\bibrangessep11 DOI: 10.1002/sta4.314
Appendix A Hard thresholding
In this section, we collect results about the random variable for and defined in (13). We consider the tail and the bulk separately.
The proof outlines are similar to those employed in [34]. However, they only consider , whereas we need to consider the general case. Consequently, a much more careful analysis is required.
A.1 Tail
Lemma 8 (Moment generating function).
Suppose is a positive universal constant. There exist universal positive constants and such that if and , then we have
where , , and where is given by (13).
Proof.
We follow the broad approach of the argument presented in the proof of Lemma 18 in [34]. The universal positive constant will be chosen later on in the proof, so for now let and . Note we will also take large enough so that Lemma 21 is applicable. Since , we have
Consider that for any we have
Therefore,
(49) |
Each term will be bounded separately. Considering the first term, note that Lemma 22 asserts for a universal constant . Note also that . Therefore,
where is a universal positive constant and remains a universal constant but whose value may change from line to line. Note we have also used Corollary 1 in the above display as well as . To summarize, we have shown
(50) |
We now bound the second term in (49). Let denote the probability density function of the distribution. We have
An application of Lemma 18 yields
where remains a universal constant but has a value which may change from line to line. Note we have used , Corollary 1, and . Likewise, consider that
where we have used , , Lemma 22, and Corollary 1. Again, and remain universal constants but have values which may change from line to line. Thus, we have shown
(51) |
We now bound the final term in (49). Consider that for we have
Let . Then and so
where . To summarize, we have shown
(52) |
To continue, we would like to apply Corollary 2, but we must first verify the conditions. Let and with . Note we can take to be a sufficiently large universal constant in order to ensure is sufficiently large. Since , it follows from Lemma 22 that for some positive universal constant . Without loss of generality, we can take . Let us restrict our attention to . Observe that
where we have defined . Since we have shown , we can apply Corollary 2. By Corollary 2 we have
Here, denotes the cumulative distribution function of the standard normal distribution and is a universal constant. By the Gaussian tail bound for where , we have
where the value of has changed from line to line but remains a universal constant. To continue the calculation, we need to bound from below. Note we can take . For , it follows from and that
Consequently, we have the bound
where the value of has changed but remains a universal constant. We now examine the term . Consider that
Therefore, letting the value of change from line to line but remaining a universal constant, we have from (52)
for a universal positive constant . We have used that for . Note that we can use this since . Since for all , it follows that
(53) |
where the value of , again, has changed but remains a universal constant. Putting together our bounds (50), (51), (53) into (49) yields
for where are universal constants. The proof is complete. ∎
Lemma 9.
Let . Suppose is a universal positive constant. There exists a universal constant such that for every , we have
Here, is given by (13).
Proof.
We will make a choice for at the end of the proof. Consider first the case where . Then by definition of and so we have the desired result. The second case follows since the expression inside the expectation is stochastically increasing in . Moving on to the final case, suppose . Since , we have
Consequently,
By Lemma 22 and , there exists a universal positive constant such that . Therefore,
Selecting yields the desired result. ∎
Lemma 10.
Let . If for a universal positive constant , then
Here, is given by (13) and is a universal positive constant.
Proof.
Lemma 11.
Suppose is a universal positive constant. There exist universal positive constants and such that if , , and , then
where and is given by (13). Here, is a universal positive constant.
Proof.
Let be the constant from Lemma 8. We use the Chernoff method to obtain the desired bound. Let be as in Lemma 8 and let be the universal constant from Lemma 8. For any , we have by Lemma 8
where is a universal constant and is a universal constant whose value may change from line to line but remain a universal constant. Selecting and selecting suitably yields the desired result. The proof is complete. ∎
A.2 Bulk
Lemma 12.
Proof.
The argument we give will be similar to the proof of Lemma 8, namely we will separately bound each term in (49) and substitute into the equation
We start with the first term on the right-hand side in (49). By Lemma 18, we have for a universal positive constant . Note that since we trivially have . Consequently, arguing as in the proof of Lemma 8, we have
where are universal constants. We have used Corollary 1 to obtain the final inequality. We now bound the second term in (49). Letting denote the probability density function of the distribution, we can repeat and then continue the calculation in the proof of Lemma 8 to obtain
where the values of have changed but remain universal constants. We have used Lemma 24 here.
We now bound the final term in (49). Arguing exactly as in the proof of Lemma 8, we have
for . Recall that .
Let us restrict our attention to . We seek to apply Corollary 2, but we must verify the conditions. Let and with . Observe
Consider that . Therefore, and so
Note we have used . Since and which is a sufficiently large universal constant, we can apply Corollary 2. By Corollary 2, we have
Here, is a positive universal constant. Recall that denotes the cumulative distribution function of the standard normal distribution. By the Gaussian tail bound for , we have
(54) |
where the value of has changed but remains a universal constant. To continue with the bound, we need to bound from below. Let us take larger than . Since , we have
where is a universal constant. We can conclude from (54) that
where the value of has changed but remains a universal constant. We now examine the term . Arguing exactly as in the proof of Lemma 8, we obtain
which, as argued in the proof of Lemma 8, leads us to the bound
for as desired. ∎
Lemma 13.
Proof.
Lemma 14.
Proof.
Let and respectively denote the probability density function and cumulative distribution function of the distribution.
Case 1: Consider the first case in which . Then
where we have applied the definition of and we have applied Lemma 24. Here, is a universal positive constant. An application of Corollary 1 gives the desired result for this case.
We now move to the other two cases. Suppose . For ease of notation, let . Observe
(55) |
Note that since (by an appeal to stochastic ordering), and so we can find upper bounds for the square of this conditional expectation by first finding upper bounds on the conditional expectation. We examine each term separately in the above display. First, consider
(56) |
We now examine the second term. Note we can write where . Therefore,
(57) |
where we have used that . With this in hand, we have , which further implies . Note we have also used to obtain the second line in the above display. With the above display in hand, we now examine the remaining two cases.
Case 2: Consider the case . Observe that
Examining the denominator, since where the two -variates on the right hand side are independent, we have
where is a universal positive constant. Examining the numerator, consider that by Lemma 18 we have
Hence, from (57), we have shown
Thus, we have the bound
Consider that
where the value of can change in each expression but remains a universal positive constant. The final inequality follows from Stirling’s formula. Moreover, since , there exists a positive universal constant such that . Furthermore, it follows by Chebyshev’s inequality that
Consequently,
The proof for this case is complete.
Lemma 15.
Appendix B Properties of the distribution
Theorem 7 (Bernstein’s inequality - Theorem 2.8.1 [46]).
Let be independent mean-zero subexponential random variables. Then, for every , we have
where is a universal constant and denote the subgaussian and subexponential norms respectively (see (2.13) and (2.21) of [46])
Corollary 1.
Suppose . If , then
where is a universal constant.
Lemma 16 (Lemma 1 [31]).
Let . If and , then
Lemma 17 (Corollary 3 [51]).
Suppose . There exist universal constants such that
for all .
Lemma 18 ([28]).
Let and respectively denote the probability density and cumulative distribution functions of the distribution. Then the following relations hold,
-
(i)
,
-
(ii)
,
-
(iii)
,
-
(iv)
.
Lemma 19.
Let and respectively denote the probability density and cumulative distribution functions of the distribution. Suppose . If , then .
Proof.
By Mean Value Theorem, we have for any ,
Consider that . So taking yields
We now evaluate the infimum on the left-hand side. For , consider that an application of Lemma 18 gives
Since , it follows that
Thus we can immediately conclude as desired. ∎
Lemma 20.
Let denote the cumulative distribution function of the distribution. If , then
where is the upper incomplete gamma function defined in Theorem 8.
Proof.
The result follows directly from a change of variables when integrating the probability density function. ∎
Lemma 21.
Suppose is larger than a sufficiently large universal constant. There exist universal positive constants and such that the following holds. If and , then
where is given by (13).
Proof.
For ease of notation, let . Let and denote the probability density and cumulative distribution functions of the distribution. We will select the universal constant later on in the proof. By Lemma 18, we have
Rearranging terms and invoking Stirling’s approximation (which states as ) yields
for a universal constant since is larger than a sufficiently large universal constant. Applying Lemma 20 and Corollary 2 yields
where is a universal constant. Observe that is larger than a sufficiently large universal constant since is larger than a sufficiently large universal constant. Using the fact that as , we have
for a universal positive constant . Consider that
Consequently,
Therefore, we have the bound
Consider further that the inequality gives us
Hence, we have
Let us take . With this choice, we have and so . Therefore,
for a universal positive constant . Thus we have
as desired. ∎
Theorem 8 (Uniform expansion of the incomplete gamma function [43]).
For and real numbers, define the upper incomplete gamma function
Further define , and . Then admits an asymptotic series expansion in which is uniform in . In other words, for any integer , we have
where the remainder term satisfies
Here, denotes the cumulative distribution function of the standard normal distribution.
Corollary 2.
Consider the setting of Theorem 8. For any and , we have
where
Consequently, if is larger than some universal positive constant and , we have
where is some universal positive constant.
Proof.
The first two displays follow exactly from Theorems 8 and 9. To show the final display, we must show that whenever . First, consider the Taylor expansion
where is some point between and . Therefore,
Thus
since is between and . Therefore, the final display in the statement of the Corollary follows by taking to be larger than some universal constant and taking to be a large enough universal constant. ∎
Lemma 22.
Let . If , then
where is a positive universal constant.
Proof.
We will choose a universal constant at the end of the proof, so for now we leave it as undetermined. Let denote the probability density function of the distribution. Observe that
By Corollary 1, there exists a universal constant such that
Here we have used . Further consider that
Therefore, by Lemma 17 we have
where and are universal positive constants. Taking completes the proof. ∎
Lemma 23.
Let . If , then
where is a positive universal constant.
Proof.
We will choose a universal constant at the end of the proof, so for now we leave it as undetermined. Let denote the probability density function of the distribution. Consider
By Corollary 1, there exists a universal positive constant such that
Here we have used . Further consider
Hence, by Lemma 17 we have
where is a universal constant. Taking completes the proof. ∎
Lemma 24.
Let . If , then
for some universal positive constant .
Proof.
We will use a universal constant in our proof; a choice for it will be made at the end. Let denote the probability density of the distribution. Observe that
By Corollary 1, there exists a universal positive constant such that
Here, we have used . With this term under control, now observe
Hence, by Lemma 17 we have
Taking , we have
which is precisely the desired result. ∎
Appendix C Miscellaneous
C.1 Minimax
Lemma 25.
We have
Consequently, .
Proof.
Consider that
Here, we have used the ordering of the eigenvalues and the fact that to obtain the second term. The bound follows immediately from and by definition of . ∎
C.2 Adaptation
Lemma 26.
Suppose . We have
Consequently, .
Proof.
Consider that
Here, we have used the ordering of the eigenvalues and the fact that to obtain the second term. To bound follows immediately from and by definition of . ∎
Proof of Lemma 2.
We first prove the second inequality. Since and , it immediately follows that . Therefore,
as desired.
Now we prove the first inequality. For any , we have by the ordering of the eigenvalues
For any , by definition of it follows that
Since for all we have shown
taking max over yields
as desired. ∎
Proof of Lemma 3.
For ease of notation, let us write . By the assumption and we have for all . By definition of , we have
and
Define
It is clear for each since the inequality in the penultimate display is strict. Moreover, observe that for all by definition of . Moreover, since is decreasing in its second argument, it thus follows that for all . Letting , it immediately follows that for all . By definition of , it follows that for any
where we have used and to obtain the final inequality. The proof is complete. ∎
C.3 Technical
Proposition 5 (Ingster-Suslina method [24]).
Suppose is a positive definite matrix and is a parameter space. If is a probability distribution supported on , then
where and denotes the -divergence defined in Section 1.1.
Lemma 27 ([45]).
If are probability measures on a measurable space with , then .
Lemma 28.
If is distributed according to the hypergeometric distribution with probability mass function for , then for .
Corollary 3 ([10]).
If is distributed according to the hypergeometric distribution with probability mass function for , then and for .