High-dimensional Asymptotic Theory of Bayesian Multiple Testing Procedures Under General Dependent Setup and Possible Misspecification
Abstract
In this article, we investigate the asymptotic properties of Bayesian multiple testing procedures under general dependent setup, when the sample size and the number of hypotheses both tend to infinity. Specifically, we investigate strong consistency of the procedures and asymptotic properties of different versions of false discovery and false non-discovery rates under the high dimensional setup. We particularly focus on a novel Bayesian non-marginal multiple testing procedure and its associated error rates in this regard. Our results show that the asymptotic convergence rates of the error rates are directly associated with the Kullback-Leibler divergence from the true model, and the results hold even when the postulated class of models is misspecified.
For illustration of our high-dimensional asymptotic theory, we consider a Bayesian variable selection problem in a time-varying covariate selection framework,
with autoregressive response variables. We particularly focus on the setup where the number of hypotheses
increases at a faster rate compared to the sample size, which is the so-called ultra-high dimensional situation.
MSC 2010 subject classifications: Primary 62F05, 62F15; secondary 62C10, 62J07.
Keywords: Bayesian multiple testing, Dependence, False discovery rate, Kullback-Leibler, Posterior convergence, Ultra high dimension.
1 Introduction
The area of multiple hypotheses testing has gained much importance and popularity, particularly in this era of big data, where often very large number of hypotheses need to be tested simultaneously. There are applications abound in the fields of statistical genetics, spatio-temporal statistics, brain imaging, to name a few. On the theoretical side, it is important to establish validity of the multiple testing procedure in the sense that the procedure controls the false discovery rate at some pre-specified level or attains oracle, as the number of tests grows to infinity.
Although there is considerable literature addressing these issues, the important factor of dependence among the tests seem to have attained less attention. Indeed, realistically, the test statistics or the parameters can not be expected to be independent. In this regard, Chandra and Bhattacharya (2019) introduced a novel Bayesian multiple testing procedure that coherently accounts for such dependence and yields joint decision rules that are functions of appropriate joint posterior probabilities. As demonstrated in Chandra and Bhattacharya (2019) and Chandra and Bhattacharya (2020), the new Bayesian method significantly outperforms existing popular multiple testing methods by proper utilization of the dependence structures. Since in the new method the decisions are taken jointly, the method is referred to as Bayesian non-marginal multiple testing procedure.
Chandra and Bhattacharya (2020) investigated in detail the asymptotic theory of the non-marginal procedure, and indeed general Bayesian multiple testing methods under additive loss, for fixed number of hypotheses, when the sample size tends to infinity. In particular, they provided sufficient conditions for strong consistency of such procedures and also showed that the asymptotic convergence rates of the versions of and false non-discovery rate () are directly related to the Kullback-Leibler (KL) divergence from the true model. Interestingly, their results continue to hold even under misspecifications, that is, if the class of postulated models does not include the true model. In this work, we investigate the asymptotic properties of the Bayesian non-marginal procedure in particular, and Bayesian multiple testing methods under additive loss in general, when the sample size, as well as the number of hypotheses, tend to infinity.
As mentioned earlier, asymptotic works in multiple testing when the number of hypotheses grows to infinity, are not rare. Xie et al. (2011) have proposed an asymptotic optimal decision rule for short range dependent data with dependent test statistics. Bogdan et al. (2011) studied the oracle properties and Bayes risk of several multiple testing methods under sparsity in Bayesian decision-theoretic setup. Datta and Ghosh (2013) studied oracle properties for horse-shoe prior when the number of tests grows to infinity. However, in the aforementioned works, the test statistics are independent and follow Gaussian distribution. Fan et al. (2012) proposed a method of dealing with correlated test statistics where the covariance structure is known. Their method is based on principal eigen values of the covariance matrix, which they termed as principal factors. Using those principal factors their method dilutes the association between correlated statistics to deal with an arbitrary dependence structure. They also derived an approximate consistent estimator for the false discovery proportion (FDP) in large-scale multiple testing. Fan and Han (2017) extended this work when the dependence structure is unknown. In these approaches, the decision rules are marginal and the test statistics jointly follow multivariate Gaussian distribution. Chandra and Bhattacharya (2019) argue that when the decision rules corresponding to different hypotheses are marginal, the full potential of the dependence structure is not properly exploited. Results of extensive simulation studies reported in Chandra and Bhattacharya (2019) and Chandra and Bhattacharya (2020), demonstrating superior performance of the Bayesian non-marginal method compared to popular marginal methods, even for large number of hypotheses, seem to vindicate this issue. This makes asymptotic analysis of the Bayesian non-marginal method with increasing number of hypotheses all the more important.
To be more specific, we investigate the asymptotic theory of the Bayesian non-marginal procedure in the general dependence setup, without any particular model assumption, when the sample size () and the number of hypotheses (, which may be a function of ), both tend to infinity. We establish strong consistency of the procedure and show that even in this setup, the convergence rates of versions of and are directly related to the KL-divergence from the true model. We show that our results continue to hold for general Bayesian procedures under the additive loss function. In the Bayesian non-marginal context we illustrate the theory with the time-varying covariate selection problem, where the number of covariates tends to infinity with the sample size . We distinguish between the two setups: ultra high-dimensional case, that is, where (or some constant), as , and the high-dimensional but not ultra high-dimensional case, that is, and , as . We particularly focus on the ultra high-dimensional setup because of its much more challenging nature.
2 A brief overview of the Bayesian non-marginal procedure
Let denote the available data set. Suppose the data is modelled by the family of distributions (which may also be non-parametric). For , let us denote by the relevant parameter space associated with , where we allow to be infinity as well. Let and denote the posterior distribution and expectation respectively of given and let and denote the marginal distribution and expectation of respectively. Let us consider the problem of testing hypotheses simultaneously corresponding to the actual parameters of interest, where .
Without loss of generality, let us consider testing the parameters associated with ; , formalized as:
where
Let
(2.1) | ||||
(2.2) |
Following Chandra and Bhattacharya (2019) we define to be the set of hypotheses, including the -th one, which are highly dependent, and define
(2.3) |
If, for any , , a singleton, then we define . Chandra and Bhattacharya (2019) maximize the posterior expectation of the number of true positives
(2.4) |
subject to controlling the posterior expectation of the error term
(2.5) |
which is actually the posterior mean of the sum of three error terms , and . For detailed discussion regarding these, see Chandra and Bhattacharya (2019).
It follows that the decision configuration can be obtained by minimizing the function
with respect to all possible decision configurations of the form , where , and
(2.6) |
is the posterior probability of the decision configuration being correct. Letting , one can equivalently maximize
(2.7) |
with respect to and obtain the optimal decision configuration.
Definition 1
Let be the set of all -dimensional binary vectors denoting all possible decision configurations. Define
where . Then is the optimal decision configuration obtained as the solution of the non-marginal multiple testing method.
For detailed discussion regarding the choice of s in (2.3), see Chandra and Bhattacharya (2019) and Chandra and Bhattacharya (2020). In particular, Chandra and Bhattacharya (2020) show that asymptotically, the Bayesian non-marginal method is robust with respect to s in the sense that it is consistent with respect to any choice of the grouping structure. As will be shown in this article, the same holds even in the high-dimensional asymptotic setup.
2.1 Error measures in multiple testing
Storey (2003) advocated positive False Discovery Rate as a measure of type-I error in multiple testing. Let be the probability of choosing as the optimal decision configuration given data when a multiple testing method is employed. Then is defined as:
(2.8) |
Analogous to type-II error, the positive False Non-discovery Rate is defined as
(2.9) |
Under prior , Sarkar et al. (2008) defined posterior and . The measures are given as following:
(2.10) | ||||
(2.11) |
where . Also under any non-randomized decision rule , is either 1 or 0 depending on data . Given , we denote these posterior error measures by and respectively.
With respect to the new notions of errors in (2.4) and (2.5), Chandra and Bhattacharya (2019) modified as
(2.12) |
and as
(2.13) |
We denote and by and respectively. Notably, the expectations of and with respect to , conditioned on the fact that their respective denominators are positive, yields the positive Bayesian and respectively. The same expectation over and yields modified positive and modified positive respectively.
Müller et al. (2004) (see also (Sun and Cai, 2009; Xie et al., 2011)) considered the following additive loss function
(2.14) |
where is a positive constant. The decision rule that minimizes the posterior risk of the above loss is for all , where is the indicator function. Observe that the non-marginal method boils down to this additive loss function based approach when , that is, when the information regarding dependence between hypotheses is not available or overlooked. Hence, the convergence properties of the additive loss function based methods can be easily derived from our theories.
Note that multiple testing problems can be regarded as model selection problems where the task is to choose the correct specification for the parameters under consideration. The model is misspecified even if one decision is taken incorrectly. Under quite general conditions, Shalizi (2009) investigated asymptotic behaviour of misspecified models. We adopt his basic assumptions and some of his convergence results to build a general asymptotic theory for our Bayesian non-marginal multiple testing method in high dimensions.
In Section 3, we provide the setup, assumptions and the main result of Shalizi (2009) which we adopt for our purpose. In Section 4 we address consistency of the Bayesian non-marginal method and convergence of the associated error terms in the high-dimensional setup. High-dimensional asymptotic analyses of versions of and are detailed in Sections 5 and 6, respectively. In Section 7, we establish the high-dimensional asymptotic theory for and when versions of are -controlled asymptotically. We illustrate the asymptotic properties of the non-marginal method in a multiple testing setup associated with an autoregressive model involving time-varying covariates in Section 8, in high-dimensional contexts. Finally, in Section 9 we summarize our contributions and provide concluding remarks.
3 Preliminaries for ensuring posterior convergence under general setup
Following Shalizi (2009) we consider a probability space , and a sequence of random variables , taking values in some measurable space , whose infinite-dimensional distribution is . The natural filtration of this process is .
We denote the distributions of processes adapted to by , where is associated with a measurable space , and is generally infinite-dimensional. For the sake of convenience, we assume, as in Shalizi (2009), that and all the are dominated by a common reference measure, with respective densities and . The usual assumptions that or even lies in the support of the prior on , are not required for Shalizi’s result, rendering it very general indeed. We put the prior distribution on the parameter space .
3.1 Assumptions and theorem of Shalizi
-
(S1)
Consider the following likelihood ratio:
(3.1) Assume that is -measurable for all .
-
(S2)
For each , the generalized or relative asymptotic equipartition property holds, and so, almost surely,
where is given in (S3) below.
-
(S3)
For every , the KL-divergence rate
(3.2) exists (possibly being infinite) and is -measurable.
-
(S4)
Let . The prior satisfies .
Following the notation of Shalizi (2009), for , let
(3.3) (3.4) (3.5) -
(S5)
There exists a sequence of sets as such that:
-
(1)
(3.6) -
(2)
The convergence in (S3) is uniform in over .
-
(3)
, as .
For each measurable , for every , there exists a random natural number such that
(3.7) for all , provided . Regarding this, the following assumption has been made by Shalizi:
-
(1)
-
(S6)
The sets of (S5) can be chosen such that for every , the inequality holds almost surely for all sufficiently large .
-
(S7)
The sets of (S5) and (S6) can be chosen such that for any set with ,
(3.8) as .
Under the above assumptions, the following version of the theorem of Shalizi (2009) can be seen to hold.
Theorem 2 ((Shalizi, 2009))
4 Consistency of multiple testing procedures when the number of hypotheses tends to infinity
In this section we show that the non-marginal procedure is asymptotically consistent under any general dependency model satisfying the conditions in Section 3.1. Since one of our main goals is to allow for misspecification, we must define consistency of multiple testing methods encompassing misspecification, while also allowing for hypotheses where , where or . We formalize this below by introducing appropriate notions.
4.1 Consistency of multiple testing procedures under misspecification
Let be the infinite dimensional parameter space of the countably infinite set of parameters . In this case, any decision configuration is also an infinite dimensional vector of 0’s and 1’s. Define , where denotes cartesian product, and denotes the actual infinite dimensional decision configuration satisfying . This definition of accounts for misspecification in the sense that is the minimizer of the KL-divergence from the true data-generating model. For any decision , let denote the first components of . Let denote the set of all possible decision configurations corresponding to hypotheses. With the aforementioned notions, we now define consistency of multiple testing procedures.
Definition 3
Let be the true decision configuration among all possible decision configurations in . Then a multiple testing method is said to be asymptotically consistent if almost surely
(4.1) |
Recall the constant in (2.7), which is the penalizing constant between the error and true positives . For consistency of the non-marginal procedure, we need certain conditions on , which we state below. These conditions will also play important roles in the asymptotic studies of the different versions of and that we consider.
-
(A1)
We assume that the sequence is neither too small nor too large, that is,
(4.2) (4.3) -
(A2)
We assume that neither all the null hypotheses are true and nor all of then are false for hypotheses being considered, that is, and , where and are vectors of 0’s and 1’s respectively.
Condition (A1) is necessary for the asymptotic consistency of both the non-marginal method and additive loss function based method. This ensures that the penalizing constant is asymptotically bounded away from 0 and 1, that is, it is neither too small nor too large. Notably, (A2) is not required for the consistency results. The role of (A2) is to ensure that the denominator terms in the multiple testing error measures (defined in Section 2.1) do not become 0.
4.2 Main results on consistency in the infinite-dimensional setup
In this section we investigate the asymptotic properties of the Bayesian non-marginal method and Müller et al. (2004) when tends to infinity or some positive constant. It is to be noted that result (3.9) of Shalizi (2009) holds even for infinite-dimensional parameter space. Exploiting this fact we derive the results in this section.
Note that if there exists a value of that minimizes the KL-divergence, then is in the set . Let us denote by the complement of . Observe that if lies in the interior of , then . It then holds that
(4.4) |
which implies that for any , there exists a such that for all
(4.5) | ||||
(4.6) |
For notational convenience, we shall henceforth denote by .
Note that the groups also depend upon in our setup; hence, we denote them by . For any decision configuration and group let . Define
Here is the set of all decision configurations where the decisions corresponding to the hypotheses in are at least correct. Clearly contains for all .
Hence, . Observe that if , at least one decision is wrong corresponding to some parameter in . As is the posterior probability of at least one wrong decision in the infinite dimensional parameter space, we have
(4.7) |
Also if is true, then
(4.8) |
Similarly for and for false
(4.9) |
(4.10) |
It is important to note that the inequalities (4.7)-(4.10) hold for all and this is the same for all , thanks to validity of Shalizi’s result in infinite dimensional parameter space. Exploiting the properties of Shalizi’s theorem we will now establish consistency of the Bayesian non-marginal method for increasing number of hypotheses.
Theorem 4
Let denote the decision rule corresponding to the Bayesian non-marginal procedure for hypotheses being tested using samples of size , where as . Assume Shalizi’s conditions and assumption (A1). Also assume that . Then,
(4.11) | |||
(4.12) |
Corollary 5
Remark 6
Note that Theorem 4 does not require any condition regarding the growth of with respect to , and holds if as , where is some constant, or infinity. Thus, the result seems to be extremely satisfactory. However, restrictions on the growth of needs to be generally imposed to satisfy the conditions of Shalizi. An illustration in this regard is provided in Section 8.
5 High-dimensional asymptotic analyses of versions of
For a fixed number of hypotheses , Chandra and Bhattacharya (2020) investigated convergence of different versions of as the sample size tends to infinity. They show that show that the convergence rates of the posterior error measures and are directly associated with the KL-divergence from the true model. Indeed, they were able to obtain the exact limits of and in terms of the relevant -dimensional KL-divergence rate.
In the current high-dimensional setup, however, such exact KL-divergence rate can not be expected to be available since the number of hypotheses is not fixed. As , it is plausible to expect that the convergence rates depend upon the infinite-dimensional KL-divergence . We show that this is indeed the case, but the exact limit is not available, which is again to be expected, since approaches infinity, not equal to infinity. Here, in the high-dimensional setup we obtain as an upper bound of the limit supremums. It is easy to observe that the limits in the finite-dimensional setup are bounded above by , thus providing evidence of internal consistency as we move from fixed-dimensional to high-dimensional setup.
We also show that and approach zero, even though the rates of convergence are not available. Recall that even in the fixed-dimensional setup, the convergence rates of and were not available. As in the consistency result, these results too do not require any restriction on the growth rate of , except that required for Shalizi’s conditions to hold.
We present our results below, the proofs of which are presented in the supplement.
Theorem 7
Assume the setup and conditions of Theorem 4. Then, for any , there exists such that for , the following hold almost surely:
(5.1) | ||||
(5.2) |
The above theorem shows that the convergence rate of and to 0 for arbitrarily large number of hypotheses is at exponential rate, for arbitrary growth rate of with respect to . However, again Shalizi’s conditions would require restriction on the growth rate of .
Corollary 8
Under the setup and assumptions of Theorem 4,
(5.3) | ||||
(5.4) |
6 High-dimensional asymptotic analyses of versions of
High-dimensional asymptotic treatments of versions of are similar to those for versions of . In particular, limit supremums of both and are bounded above by , and that both and converge to zero. The proofs of these results are also similar to those for the respective versions. Internal consistency of these results is again evident as the limits of and in the finite dimensional setups are bounded above by and and converge to zero for fixed number of hypotheses. In the latter cases, convergence rates are not available for either fixed or high-dimensional cases. Below we provide the relevant results on versions of , with proofs in the supplement.
Theorem 10
Assume the setup and conditions of Theorem 4. Then, for any , there exists such that for , the following hold almost surely:
(6.1) | ||||
(6.2) |
The above theorem shows that the convergence rate of and to 0 for arbitrarily large number of hypotheses is at exponential rate, for arbitrary growth rate of with respect to . However, again Shalizi’s conditions would require restriction on the growth rate of .
Corollary 11
Under the setup and assumptions of Theorem 4,
(6.3) | ||||
(6.4) |
7 High-dimensional asymptotics for and when versions of are -controlled
It has been proved in Chandra and Bhattacharya (2019) for the non-marginal multiple testing procedure and additive loss-function based methods, and are continuous and non-increasing in . Consequently, for suitable values of any can be achieved by these errors. For suitably chosen positive values of , one can hope to reduce the corresponding . This is standard practice even in the single hypothesis testing literature, where the Type-I error is controlled at some positive value so that a reduced Type-II error may be incurred. However, as shown in Chandra and Bhattacharya (2020) in the fixed-dimensional setup, for the non-marginal multiple testing procedure and additive loss-function based methods, values of that are as close to 1 as desired, can not be attained by versions of as the sample size tends to infinity. This is not surprising, however, since consistent procedures are not expected to incur large errors asymptotically, at least when the number of hypothesis is fixed. Indeed, in the fixed-dimensional setup, Chandra and Bhattacharya (2020) provided an interval of the form where , in which maximum values of the versions of can lie asymptotically and obtained asymptotic results for for such -controlled versions of .
In this section we investigate the asymptotic theory for -control in the high-dimensional context, that is, when as . Although none of our previous high-dimensional results did not require any explicit restrictions on the growth rate of given that the posterior convergence result of Shalizi holds, here we need a very mild condition on that it grows slower than the exponential rate in . We also need to fix the proportion () of true alternatives as , and the proportion () of groups associated with at least one false null hypothesis. As we show, these two proportions define an interval of the form , with , in which the maximum of the versions of lie, as with . In contrast with the fixed-dimensional asymptotics of Chandra and Bhattacharya (2020), the lower bound of the interval is zero for high dimensions, not strictly positive. To explain, for fixed dimension , the lower bound was . Intuitively, replacing and with and respectively, dividing both numerator and denominator of by , taking the limit, replacing the denominator with , we obtain , as . Similar intuition can be used to verify that the upper bound in the fixed dimensional case converges to in the high-dimensional setup. As in our previous results, these provide a verification of internal consistency in the case of transition from fixed-dimensional to high-dimensional situations.
Our results regarding asymptotic control of versions of and corresponding convergence of versions of are detailed in Sections 7.1 and 7.2.
7.1 High-dimensional -control of and for the non-marginal method
The following theorem provides the interval for the maximum that can be incurred asymptotically in the high-dimensional setup.
Theorem 13
In addition to (A1)-(A2), assume the following:
-
(B)
For each , let each group of a particular set of groups out of the total groups be associated with at least one false null hypothesis, and that all the null hypotheses associated with the remaining groups be true. Let us further assume that the latter groups do not have any overlap with the remaining groups. Without loss of generality assume that are the groups each consisting of at least one false null and are the groups where all the null hypotheses are true. Assume further, the following limits:
(7.1) (7.2) (7.3)
Then the maximum that can be incurred, asymptotically lies in .
Remark 14
If is close to zero, that is, if all but a finite number of null hypotheses are true, then , showing that in such cases, better -control can be exercised. Indeed, as the proof of the theorem shows, the optimal decision in this case will be given by all but a finite set of one’s, so that all but a finite number of decisions are correct. Hence, maximum error occurs in this case. Also, if is close to , then . In other words, if all but a finite number of groups are associated with at least one false null hypothesis, then almost no error can be incurred. As the proof Theorem 13 shows, this is the case where all but a finite number of decisions are correct, and hence, it is not surprising that almost no error can be incurred in this case.
Remark 15
Also, as in the fixed-dimensional case, Theorem 13 holds, if for at least one , . But if for , then as , for any sequence .
Remark 16
The following theorem shows that for feasible values of attained asymptotically by the maximum of , for appropriate sequences of penalizing constants , it is possible to asymptotically approach such through , where denotes for the non-marginal procedure where the penalizing constant is .
Theorem 17
From the proofs of Theorem 13 and 17, it can be seen that replacing by does not affect the results. Hence we state the following corollary.
Corollary 18
Let denote the corresponding to the non-marginal procedure where the penalizing constant is . Suppose that
Then, for any and , under condition (B), there exists a sequence such that as .
As in the fixed-dimensional setup, we see that for -control we must have , and that for , tends to zero. In other words, even in the high-dimensional setup, -control requires a sequence that is smaller that that for which tends to zero.
Since the additive loss function based methods are special cases of the non-marginal procedure where for all (see Chandra and Bhattacharya (2019), Chandra and Bhattacharya (2020)), and that in such cases, reduces to , it is important to investigate asymptotic -control of in this situation. Our result in this direction is provided in Theorem 19.
Theorem 19
Let be the number of true null hypotheses such that , as . Then for any , there exists a sequence as such that for the additive loss function based methods
The result is similar in spirit to that obtained by Chandra and Bhattacharya (2020) in the corresponding finite dimensional situation. The limit of in the corresponding high-dimensional setup, instead of in the fixed dimensional case, plays the central role here.
Chandra and Bhattacharya (2019) and Chandra and Bhattacharya (2020) noted that even for additive loss function based multiple testing procedures, may be a more desirable candidate compared to since it can yield non-marginal decisions even if the multiple testing criterion to be optimized is a simple sum of loss functions designed to yield marginal decisions. The following theorem shows that the same high-dimensional asymptotic result as Theorem 19 also holds for in the case of additive loss functions, without the requirement of condition (B). Non-requirement of condition (B) even in the high-dimensional setup can be attributed to the fact that for any multiple testing method , for arbitrary sample size.
Theorem 20
Let be the number of true null hypotheses such that , as . Let be the desired level of significance where . Then there exists a sequence as such that for the additive loss function based method
Note that Bayesian versions of (conditional on the data) need not be continuous with respect to , and so results for such Bayesian versions similar to Theorem 17, Corollary 18 and Theorems 19, 20, which heavily use such continuity property, could not be established.
Thus, interestingly, all the asymptotic results for -control of versions of in the fixed dimensional setup admitted simple extensions to the high-dimensional setup, with minimal assumption regarding the growth rate of , given Shalizi’s conditions hold. Since Shalizi’s conditions are meant for posterior consistency, from the multiple testing perspective, our high-dimensional results are very interesting in the sense that almost no extra assumptions are required in addition to Shalizi’s conditions for our multiple testing results to carry over from fixed dimension to high dimensions.
7.2 High-dimensional properties of Type-II errors when and are asymptotically controlled at
In this section, we investigate the high-dimensional asymptotic theory for and associated with -control of versions of . Our results in these regards are provided as Theorem 21 and Corollary 22.
Theorem 21
Assume condition (B) and that , as . Then for asymptotic -control of in the non-marginal procedure the following holds almost surely:
The above theorem requires the very mild assumption that , as , in addition to (B). The result shows that converges to zero at an exponential rate, but again the exact limit of is not available in this high-dimensional setup. This is slightly disconcerting in the sense that we are now unable to compare the rates of convergence of for cases where -control is imposed and not imposed. Indeed, for the fixed-dimensional setup, Chandra and Bhattacharya (2020) could obtain exact limits and consequently show that converges to zero at a rate faster than or equal to that compared to the case when control is not exercised. However, as we already argued in the context of versions of , exact limits are not expected to be available in these cases for high dimensions.
Corollary 22
Assume condition (B) and that , as . Then for asymptotic -control of in the non-marginal procedure the following holds:
Thus, as in the fixed dimensional setup, Corollary 22 shows that corresponding to -control, converges to zero even in the high-dimensional setup, and that the rate of convergence to zero is unavailable.
8 Illustration of consistency of our non-marginal multiple testing procedure in time-varying covariate selection in autoregressive process
Let the true model stand for the following model consisting of time-varying covariates:
(8.1) |
where , and , for . In (8.1), as . Here are relevant time-varying covariates. We set for all .
Now let the data be modeled by the same model as but with , and be replaced with the unknown quantities , and , respectively, that is,
(8.2) |
where we set , , for .
For notational purposes, we let , , , and .
8.1 The ultra high-dimensional setup
Let us first consider the setup where as . This is a challenging problem, and we require notions of sparsity to address such a problem. As will be shown subsequently in Section 8.2, a precise notion of sparsity is available for our problem in the context of the equipartition property. Specifically sparsity in our problem entails controlling relevant quadratic forms of . For such sparsity, we must devise a prior for such that . We also assume that .
For appropriate prior structures for , let us consider the following strategy. First, let us consider an almost surely continuously differentiable random function on a compact space , such that
(8.3) |
We denote the class of such functions as . A popular prior for is the Gaussian process prior with sufficiently smooth covariance function, in which case, both and are Gaussian processes; see, for example, Cramer and Leadbetter (1967). Let us now consider an arbitrary sequence , and let , where, for , . We then define , where for , are independent (but non-identical) random variables, such that for , and
(8.4) |
Also, let and . Thus, , where , and , is the parameter space. For our asymptotic theories regarding the multiple testing methods that we consider, we must verify the assumptions of Shalizi for the modeling setups (8.1) and (8.2), with this parameter space.
With respect to the above ultra high-dimensional setup, we consider the following multiple-testing framework:
(8.5) |
where is some neighborhood of zero and is the complement of the neighborhood in the relevant parameter space.
Verification of consistency of our non-marginal procedure amounts to verification of assumptions (S1)–(S7) of Shalizi for the above setup. In this regard, we make the following assumptions:
-
(B1)
, where, for , .
-
(B2)
For , let be the largest eigenvalue of . We assume that , as , for .
-
(B3)
Let be the largest eigenvalue of . We assume that .
-
(B4)
(8.6) (8.7) (8.8) as . In the above, is a finite constant; and are finite quantities that depend upon the choice of the sequence .
-
(B5)
The limits of the quantities for almost all , and exist as .
-
(B6)
There exist positive constants , , , , and such that the following hold for sufficiently large :
-
(B7)
, for , for some .
8.2 Discussion of the assumptions in the light of the ultra high-dimensional setup
Condition (B1) holds if the covariates , is a realization of some stochastic process with almost surely finite sup-norm, for example, Gaussian process. Assumption (B1), along with (8.3) and (8.4) leads to the following result:
(8.9) |
for some . To see this, first let correspond to the true quantities and . Then observe that , since by (B5), by (8.3) and by (8.4). Condition (B1) is required for some limit calculations and boundedness of some norms associated with concentration inequalities.
Condition (B2) says that the covariates at different time points, after scaling by , are asymptotically orthogonal. This condition also imply the following:
(8.10) |
To see (8.10), observe that
(8.11) |
In (8.11), denotes the Euclidean norm of and for any matrix , denotes the operator norm of given by . By (B2), as . Also,
(8.12) |
by (8.3) and (8.4). It follows from (8.12) that (8.11) is almost surely finite. This and (B2) together imply the first part of the limit 8.10). Since , the second limit of 8.10) follows in the same way.
As shown in Section 8.3, as , even if , that is, even if (B1) does not hold. Since we assume only as much as is bounded above, (B3) is a reasonably mild assumption.
In (B4), (8.6) can be made to hold in practice by centering the covariates, that is, by setting , where . In (B1) (8.7) we assume that and remain finite for any choice of . To see that finiteness holds, first note that
(8.13) |
In (8.13), almost surely, by (8.12), and by (B3). Hence, (8.11) is finite. Similarly, , which is again almost surely finite due to (8.3), (8.4) and (B3). Thus, (8.3) and (8.4) are precisely the conditions that induce sparsity within our model in the sense of controlling the quadratic forms involving and , given that (B4) holds. Assumptions on the existence of the limits are required for conditions (S2) and (S3) of Shalizi. As can be observe from Section 8.3, , almost surely as , if the asymptotically orthogonal covariates satisfy , that is, even if (B1) does not hold. Hence, in this situation, the required limits of the quadratic forms exist and are zero, under very mild conditions.
Again, the limit existence assumption (B5) is required for verification of conditions (S2) and (S3) of Shalizi.
Assumption (B6), required to satisfy condition (S5) of Shalizi, is reasonably mild. The threshold for the probabilities involving and can be replaced with the order of for Gaussian process priors or for independent sub-Gaussian components of . However, note that priors such as gamma or inverse gamma for do not necessarily satisfy the condition. In such cases, one can modify the prior by replacing the tail part of the prior, after an arbitrarily large positive value, with a thin-tailed prior, such as normal. In practice, such modified priors would be effectively the same as gamma or inverse gamma priors, and yet would satisfy the conditions of (B6).
Assumption (B7), in conjunction with boundedness of , for all by , is a mild condition ensuring that are increasing in , when , for some .
8.3 High-dimensional but not ultra high-dimensional setup
The setup we discussed so far deals with the so-called ultra high-dimensional problem, in the sense that as . This is a challenging problem to address and we required a prior for satisfying almost surely. However, if we are only interested in the problem where as , then it is not necessary to insist on priors to ensure finiteness of . For example, if the covariates are orthogonal, then assuming that
(8.14) |
has maximum eigenvalue , so that (8.11) entails
(8.15) |
Now, if the components of are independent and sub-Gaussian with mean zero, then by the Hanson-Wright inequality (see, for example, Rudelson and Vershynin (2013)) we have
(8.16) |
where is some constant and is the upper bound of the sub-Gaussian norm. Let . If , where is finite or infinite, then (8.16) is summable. Hence, by the Borel-Cantelli lemma, almost surely, as . It then follows from (8.15) that almost surely as .
For the non-ultra high-dimensional setup, the problem is largely simplified. Indeed, introduction of and are not required, as we can directly consider sub-Gaussian priors for as detailed above. Consequently, in (B3), only the first two inequalities are needed and assumption (B6) is no longer required. Since the ultra high-dimensional setup is far more challenging than the non-ultra high-dimensional setup, we consider only the former setup for our purpose, and note that the latter setup can be dealt with using almost the same ideas but with much less effort.
Assumptions (B1)–(B6) lead to the following results that are the main ingredients in proving our posterior convergence in the ultra high-dimensional setup.
Lemma 23
Under (B1), (B2) and (B5), the KL-divergence rate exists for each and is given by
(8.17) |
Theorem 24
Under (B1), (B2) and (B5), the asymptotic equipartition property holds and is given by
Furthermore, the convergence is uniform on any compact subset of .
Lemma 23 and Theorem 24 ensure that (S1) – (S3) hold, and (S4) holds since is almost surely finite. (B6) implies that increases to . In Section S-13.5 we verify (S5).
Now observe that the aim of assumption (S6) is to ensure that (see the proof of Lemma 7 of Shalizi (2009)) for every and for all sufficiently large,
Since as , it is enough to verify that for every and for all sufficiently large,
(8.18) |
In this regard, first observe that
(8.19) |
where the last inequality holds since . Now, letting , where is large as desired,
(8.20) |
From (8.17) it is clear that is continuous in and that as . In other words, is a continuous coercive function. Hence, is a compact set (see, for example, Lange (2010)). Hence it easily follows that (see Chatterjee and Bhattacharya (2020)), that
(8.21) |
We now show that
(8.22) |
First note that if infinitely often, then for some infinitely often. But if and only if Hence, if we can show that
(8.23) |
then (8.22) will be proved. We use the Borel-Cantelli lemma to prove (8.23). In other words, we prove that
The proof of Theorem 25 heavily uses (8.9), which is ensured by (B5), (8.3) and (8.4). Since is continuous, (S7) holds trivially.
We provide detailed verification of the seven assumptions of Shalizi in the supplement, which leads to the following result:
Theorem 26
Under assumptions (B1) – (B6), the non-marginal multiple testing procedure for testing (8.5) is consistent.
Needless to mention, all the results on error convergence of the non-marginal method also continue to hold for this setup under (B1) – (B6), thanks to verification of Shalizi’s conditions.
8.4 Remark on identifiability of our model and posterior consistency
Note that we have modeled in terms of and . But from the likelihood it is evident that although is identifiable, and are not. But this is not an issue since our interest is in the posterior of , not of or . Indeed, Theorem 3 of Shalizi guarantees that the posterior of the set tends to 1 as , for any . We show in the supplement that in our case. Since , where is the true parameter which includes and lies in for any , it follows that the posterior of is consistent.
9 Summary and conclusion
In this article, we have investigated asymptotic properties of the Bayesian non-marginal procedure under the general dependence structure when the number of hypotheses also tend to infinity with the sample size. We specifically showed that our method is consistent even in this setup, and that the different Bayesian versions of the error rates converge to zero exponentially fast, and that the expectations of the Bayesian versions with respect to the data also tend to zero. Since our results hold for any choice of the groups, it follows that they hold even for singleton groups, that is, for marginal decision rules. The results associated with -control also continue to hold in the same spirit as the finite-dimensional setup developed in Chandra and Bhattacharya (2020). Interestingly, provided that Shalizi’s conditions hold, almost no assumption is required on the growth rate of the number of hypotheses to establish the results of the multiple testing procedures in high dimensions. Although in several cases, unlike the exact fixed-dimensional limits established in Chandra and Bhattacharya (2020), the exact high-dimensional limits associated with the error rates could not be established, exponential convergence to zero in high dimensions could still be achieved. Moreover, internal consistency of our results, as we make transition from fixed dimension to high dimensions, are always ensured.
An important objective of this research is to show that the finite-dimensional time-varying variable selection problem in the autoregressive setup introduced in Chandra and Bhattacharya (2020) admits extension to the setup where the number of covariates to be selected by our Bayesian non-marginal procedure, grows with sample size. Indeed, we have shown that under reasonable assumptions, our asymptotic theories remain valid for this problem for both high-dimensional and ultra high-dimensional situations. Different priors for the regression coefficients are of course warranted, and we have discussed the classes of such relevant priors for the two different setups. As much as we are aware of, at least in the time series context, such high-dimensional multiple hypotheses testing is not hitherto dealt with. The priors that we introduce, particularly in the ultra high-dimensional context, also do not seem to have been considered before. These priors, in conjunction with the equipartition property, help control sparsity of the model quite precisely. As such, these ideas seem to be of independent interest for general high-dimensional asymptotics.
Supplementary Material
S-10 Proof of Theorem 4
Proof. From conditions (4.2) and (4.3), it follows that there exists such that for all
(S-10.1) | ||||
(S-10.2) |
and , for some . It follows using this, (4.7) and (4.9), that for ,
(S-10.3) | |||
(S-10.4) |
Now can be appropriately chosen such that . Hence, for ,
Hence, (4.11) holds, and by the dominated convergence theorem, (4.12) also follows.
S-11 Proof of Theorem 7
Proof.
Following Theorem 4, it holds, almost surely, that there exists such that for all , for all . Therefore, for ,
Thus, (5.1) is established. Using (4.10) and Corollary 5, (5.2) follows in the same way.
S-11.1 Proof of Theorem 9
Proof. Note that
From Theorem 7, , as . Also we have
Therefore by the dominated convergence theorem, , as . From (A2) we have and from Theorem 4 we have . Thus , as . This proves the result.
It can be similarly shown that , as .
S-12 Proof of Theorem 10
S-12.1 Proof of Theorem 12
S-12.2 Proof of Theorem 13
Proof. Theorem 3.4 of Chandra and Bhattacharya (2019) shows that is non-increasing in . Hence, for every , the maximum error that can be incurred is at where we actually maximize . Let
Since the groups in have no overlap with those in , and can be maximized separately.
Let us define the following notations:
Now,
since for any , by definition of .
Note that can not be zero as it contradicts (B) that have at least one false null hypothesis.
Now, from (4.7) and (4.9), we obtain for ,
(S-12.1) |
By our assumption (7.3), as , so that as . Also, . Hence, (S-12.1) is negative for sufficient;y large . In other words, maximizes for sufficiently large .
Let us now consider the term . Note that by (B). For any finite , is maximized for some decision configuration where for at least one . In that case,
so that for sufficiently large ,
(S-12.2) |
Now note that
(S-12.3) |
Since the right most side of (S-12.3) tends to zero as due to (7.1), it follows that as . Hence, dividing the numerators and denominators of the right hand side of (S-12.2) by and taking limit as shows that
(S-12.4) |
almost surely, for all data sequences. Boundedness of for all and ensures uniform integrability, which, in conjunction with the simple observation that for ,
for all , guarantees that under (B), .
Now, if are all disjoint, each consisting of only one true null hypothesis, then will be maximized by where for all . Since ; maximizes for large , it follows that is the maximizer of for large . In this case,
(S-12.5) |
Now, for large enough ,
(S-12.6) |
Since due to (7.2), , as , it follows from (S-12.6) that
(S-12.7) |
Also, since for large enough ,
it follows using (7.1) that
(S-12.8) |
Hence, dividing the numerator and denominator in the ratio on the right hand side of (S-12.5) by and using the limits (S-12.7), (S-12.8) and (7.1) as , yields
(S-12.9) |
Hence, in this case, the maximum (that can be incurred at ) for is given by
Note that this is also the maximum asymptotic that can be incurred among all possible configurations of . Hence, for any arbitrary configuration of groups, the maximum asymptotic that can be incurred lies in the interval .
S-12.3 Proof of Theorem 17
S-12.4 Proof of Theorem 19
Proof. From Chandra and Bhattacharya (2019) it is known that and are continuous and non-increasing in . If denotes the optimal decision configuration with respect to the additive loss function, for all , for . Thus, assuming without loss of generality that the first null hypotheses are true,
(S-12.10) |
Now, , so that as . Also, , so that , as . Hence, taking limits on both sides of (S-12.10), we obtain
The remaining part of the proof follows in the same way as that of Theorem 17.
S-12.5 Proof of Theorem 20
S-12.6 Proof of Theorem 21
Proof. Note that by Theorem 17, there exists a sequence such that , where . Let be the optimal decision configuration associated with the sequence . The proofs of Theorem 13 and 17 show that for and . Hence, using (4.8) we obtain
(S-12.11) | ||||
(S-12.12) |
Now,
Since , as ,
(S-12.13) | |||
(S-12.14) |
As is any arbitrary positive quantity we have from (S-12.12), (S-12.13) and (S-12.14) that
S-13 Verification of (S1)-(S7) in model with time-varying covariates and proofs of the relevant theorems
All the probabilities and expectations below are with respect to the true model .
S-13.1 Verification of (S1)
We obtain
(S-13.1) |
It is easily seen that is continuous in and . Hence, is measurable. In other words, (S1) holds.
S-13.2 Proof of Lemma 23
It is easy to see that under the true model ,
(S-13.2) | ||||
(S-13.3) |
where for any two sequences and , stands for as . Hence,
(S-13.4) |
Now let
(S-13.5) |
and for ,
(S-13.6) |
where, for any , is so large that
(S-13.7) |
It follows, using (8.9) and (S-13.7), that for ,
(S-13.8) |
Hence, for ,
(S-13.9) |
Now,
(S-13.10) |
Similarly, it is easily seen, using (B4), that
(S-13.11) |
Since (S-13.8) implies that for , , it follows that
(S-13.12) |
and since is arbitrary, it follows that
(S-13.13) |
Hence, it also follows from (S-13.2), (S-13.4), (B4) and (S-13.13), that
(S-13.14) |
and
(S-13.15) |
Now note that
(S-13.16) |
Using (8.10), (S-13.9) and arbitrariness of it is again easy to see that
(S-13.17) |
Also, since for by independence, and since for , it holds that
(S-13.18) |
Combining (S-13.16), (S-13.15), (S-13.17) and (S-13.18) we obtain
(S-13.19) |
Using (B4) (8.9) and arbitrariness of , it follows that
In other words, (S2) holds, with given by (8.17).
S-13.3 Proof of Theorem 24
Note that
(S-13.20) |
where is an asymptotically stationary Gaussian process with mean zero and covariance
(S-13.21) |
Then
(S-13.22) |
By (S-13.13), the first term of the right hand side of (S-13.22) converges to , as , and since ; is also an irreducible and aperiodic Markov chain, by the ergodic theorem it follows that the second term of (S-13.22) converges to almost surely, as . For the third term, we observe that
(S-13.23) |
for , where , depending upon , is sufficiently large. Recalling from (B5) that , we then see that for ,
(S-13.24) |
for . From (S-13.24) it follows that
(S-13.25) |
Since by (B5) the limit of exists as , it follows that is still an irreducible and aperiodic Markov chain with asymptotically stationary zero-mean Gaussian process. Hence, by the ergodic theorem, the third term of (S-13.22) converges to zero, almost surely, as . It follows that
(S-13.26) |
and similarly,
(S-13.27) |
Now, since , it follows using (B2) (orthogonality) and (S-13.9) that for or ,
(S-13.28) |
By (B4), the first term on the right hand side of (S-13.28) is , where is or accordingly as is or . For the second term, due to (S-13.23), , where is either or . By (B5) the limit of exists as , and hence remains an irreducible, aperiodic Markov chain with zero mean Gaussian stationary distribution. Hence, by the ergodic theorem, it follows that the second term of (S-13.28) is zero, almost surely. In other words, almost surely,
(S-13.29) |
and similar arguments show that, almost surely,
(S-13.30) |
We now calculate the limit of , as . By (S-13.16),
(S-13.31) |
By (S-13.27), the first term on the right hand side of (S-13.31) is given, almost surely, by , and the second term is almost surely zero due to (S-13.30). For the third term, note that , and hence using (S-13.23), . Both ; and ; , are sample paths of irreducible and aperiodic Markov chains having stationary distributions with mean zero. Hence, by the ergodic theorem, the third term of (S-13.31) is zero, almost surely. That is,
(S-13.32) |
S-13.4 Verification of (S4)
In the expression for given by (8.17), note that and are almost surely finite. Hence, for any prior on and such that they are almost surely finite, (S4) clearly holds. In particular, this holds for any proper priors on and .
S-13.5 Verification of (S5)
S-13.5.1 Verification of (S5) (1)
Since , it is easy to see that . Let , , , . We now define
where .
Since for all , it follows that is increasing in for , for some . To see this, note that if , then if , which holds by assumption (B7). Since as , there exists such that contains . Hence, for all . In other words, , as . Now observe that
where the last step is due to (B6).
S-13.5.2 Verification of (S5) (2)
First, we note that is compact, which can be proved using Arzela-Ascoli lemma, in almost the same way as in Chatterjee and Bhattacharya (2020). Since is compact for all , uniform convergence as required will be proven if we can show that is stochastically equicontinuous almost surely in for any and , almost surely, for all (see Newey (1991) for the general theory of uniform convergence in compact sets under stochastic equicontinuity). Since we have already verified pointwise convergence of the above for all while verifying (S3), it remains to prove stochastic equicontinuity of . Stochastic equicontinuity usually follows easily if one can prove that the function concerned is almost surely Lipschitz continuous. In our case, we can first verify Lipschitz continuity of by showing that its first partial derivatives with respect to the components of are almost surely bounded. With respect to and , the boundedness of the parameters in , (8.9) and the limit results (S-13.26), (S-13.27), (S-13.29), (S-13.30) and (S-13.32) readily show boundedness of the partial derivatives. With respect to , note that the derivative of , a relevant expression of (see (S-13.1)), is , whose Euclidean norm is bounded above by . In our case, by (B3). Moreover, is bounded in and , which is also bounded in . Boundedness of the partial derivatives with respect to of the other terms of involving are easy to observe. In other words, is stochastically equicontinuous.
To see that is equicontinuous, first note that in the expression (8.17), except the terms involving and , the other terms are easily seen to be Lipschitz, using boundedness of the partial derivatives. Let us now focus on the term . For our purpose, let us consider two different sequences and associated with and , respectively, such that and . As we have already shown that is Lipschitz in , we must have , for some Lipschitz constant . Taking the limit of both sides as shows that , proving that is Lipschitz in , when is held fixed. The bounded partial derivative with respect to also shows that is Lipschitz in both and . Similarly, the term in (8.17) is also Lipschitz continuous.
In other words, is stochastically equicontinuous almost surely in . Hence, the required uniform convergence is satisfied.
S-13.5.3 Verification of (S5) (3)
Continuity of , compactness of , along with its non-decreasing nature with respect to implies that , as . Hence, (S5) holds.
S-13.6 Verification of (S6) and proof of Theorem 25
Note that in our case,
(S-13.33) |
Let , and ; let be the Cholesky decomposition. Also let , the -dimensional normal distribution with mean , the -dimensional vector with all components zero and variance , the -dimensional identity matrix. Then
(S-13.34) | ||||
(S-13.35) |
To deal with the first term of (S-13.34) first note that is Lipschitz in , with the square of the Lipschitz constant being , which is again bounded above by , for some constant , due to (8.9). It then follows using the Gaussian concentration inequality (see, for example, Giraud (2015)) that
(S-13.36) |
Now, for large enough , noting that up to some positive constant, we have
(S-13.37) | |||
(S-13.38) |
for some positive constants and .
Now, the prior is such that large values of receive small probabilities. Hence, if this prior is replaced by an appropriate function which has a thicker tail than the prior, then the resultant integral provides an upper bound for the first term of (S-13.38). We consider a function which is of mixture form depending upon , that is, we let , where , is the number of mixture components, , for , , for and , where , and , for all and . In this case,
(S-13.39) |
Now the -th integrand of (S-13.39) is minimized at , so that for sufficiently large , , for some positive constants and . Now, for sufficiently large , we have , for . Hence, for sufficiently large , for some positive constant . From these and (S-13.38) it follows that
(S-13.40) |
for some constant . Combining (S-13.38), (S-13.39) and (S-13.40) we obtain
(S-13.41) |
For the second term of (S-13.34), since is non-random, we can also view this as a set of independent realizations from any suitable independent zero mean process with variance on a compact set (due to (8.9)). In that case, by Hoeffding’s inequality (Hoeffding, 1963) we obtain
(S-13.42) |
for some positive constants and . The last step follows in the same way as (S-13.41).
We now deal with the first term of (S-13.35). First note that , for some , where is the Frobenius norm of . Also, any eigenvalue of any matrix satisfies , by the Gerschgorin’s circle theorem (see, for example, Lange (2010)). In our case, the rows of are summable and the diagonal elements are bounded for any . The maximum row sum is attained by the middle row when is odd and the two middle rows when is even. In other words, the maximum eigenvalue of remains bounded for all . That is, , for some positive constant . Now observe that for the integral of the form , where , we can obtain, using the same technique pertaining to (S-13.41), that
(S-13.43) |
for relevant positive constants , and . Then by the Hanson-Wright inequality, (S-13.43) and the same method for obtaining (S-13.41), we obtain the following bound for the first term of (S-13.35):
(S-13.44) |
for relevant positive constants , , and .
Using the same technique involving Hoeffding’s bound for the second term of (S-13.34), it is easy to see that the second term of (S-13.35) satisfies the following:
(S-13.45) |
for relevant positive constants , , and .
Hence, combining (S-13.34), (S-13.35), (S-13.42), (S-13.44) and (S-13.45), we obtain
(S-13.46) |
for relevant positive constants.
Let us now obtain a bound for . By the same way as above, we obtain, by first taking the expectation with respect to , the following:
(S-13.47) |
for relevant positive constants. Since , it is evident that much the mass of is concentrated around zero, where the function is small. To give greater weight to the function, we can replace with a mixture function of the form , for positive constants and . Here
As before, and . Hence, up to some positive constant,
(S-13.48) |
The term within the parenthesis in the exponent of (S-13.48) is minimized at . Note that , for large enough . Hence, for large , the term within the parenthesis in the exponent of (S-13.48) exceeds , for . Thus, (S-13.48) is bounded above by a constant times . Combining this with (S-13.47) we see that
(S-13.49) |
where is the appropriate modification of in view of the transformation . Replacing with a mixture function of the form , for positive constants and , with , and , and applying the same techniques as before, we see from (S-13.49) that
(S-13.50) |
for relevant positive constants.
Let us now deal with . Now, again we assume as before that ; is a realization from some independent zero-mean process with variance . Note that . By (B1), . Let . Then using Hoeffding’s inequality in conjunction with (8.9), we obtain
(S-13.51) |
Then, first integrating with respect to , then integrating with respect to and finally with respect to , in each case using the gamma mixture form , for positive constants and , with , and , we find that
(S-13.52) |
for relevant positive constants. It is also easy to see using Hoeffding’s inequality using (8.9) that
(S-13.53) |
for relevant constants.
We next consider . Note that
(S-13.54) | |||
(S-13.55) |
Note that the expectation of (S-13.54) admits the same upper bound as (S-13.50). To deal with (S-13.55) we let and . Then , where and are appropriate modifications of and associated with (S-13.36). Note that , where . Using (8.9) we obtain the same form of the bound for (S-13.55) as (S-13.36). That is, we have
(S-13.56) |
where is some positive constant. Using the same method as before again we obtain a bound for the expectation of the first part of (S-13.56) of similar form as , for relevant positive constants. As before, here . For the second part of (S-13.56) we apply the method involving Hoeffding’s inequality as before, and obtain a bound of the above-mentioned form. Hence combining the bounds for the expectations of (S-13.51) and (S-13.55) we see that
(S-13.57) |
for relevant positive constants.
Now let us bound the probability . Observe that
(S-13.58) |
Using the Gaussian concentration inequality as before it is easily seen that
(S-13.59) |
for relevant positive constants.
The Gaussian concentration inequality also ensures that the second term of (S-13.58) is bounded above by , for some . Combining this with (S-13.58) and (S-13.59) we obtain
(S-13.60) |
for relevant positive constants. Note that, here .
For , we note that
(S-13.61) |
For the first term of (S-13.61) we apply the Gaussian concentration inequality followed by taking expectations with respect to , , and . This yields the bound
for relevant positive constants. The bound for the second term is given by . Together we thus obtain
(S-13.62) |
We now deal with the last term . Recall that , where and . Let . Then . Application of the Gaussian concentration inequality and the Hanson-Wright inequality we find that
(S-13.63) |
for some positive constants and . Taking expectation of (S-13.63) with respect to we obtain as before
(S-13.64) |
for relevant positive constants. Recall that .
S-13.7 Verification of (S7)
Since as , it follows that for any set with , , as . In our case, , and hence , are decreasing in , so that must be non-increasing in . Moreover, for any , , so that , for all . Hence, continuity of implies that , as , and (S7) is satisfied.
Thus (S1)–(S7) are satisfied, so that Shalizi’s result stated in the main manuscript holds. It follows that all our asymptotic results of our main manuscript apply to this multiple testing problem.
References
- Bogdan et al. (2011) Bogdan, M., Chakrabarti, A., Frommlet, F., and Ghosh, J. K. (2011). Asymptotic Bayes-optimality under sparsity of some multiple testing procedures. Ann. Statist., 39(3), 1551–1579.
- Chandra and Bhattacharya (2019) Chandra, N. K. and Bhattacharya, S. (2019). Non-marginal Decisions: A Novel Bayesian Multiple Testing Procedure. Electronic Journal of Statistics, 13(1), 489–535.
- Chandra and Bhattacharya (2020) Chandra, N. K. and Bhattacharya, S. (2020). Asymptotic Theory of Dependent Bayesian Multiple Testing Procedures Under Possible Model Misspecification. arXiv preprint arXiv:1611.01369.
- Chatterjee and Bhattacharya (2020) Chatterjee, D. and Bhattacharya, S. (2020). Posterior Convergence of Gaussian Process Regression Under Possible Misspecifications. arXiv preprint.
- Cramer and Leadbetter (1967) Cramer, H. and Leadbetter, M. R. (1967). Stationary and Related Stochastic Processes. Wiley, New York.
- Datta and Ghosh (2013) Datta, J. and Ghosh, J. K. (2013). Asymptotic Properties of Bayes Risk for the Horseshoe Prior. Bayesian Anal., 8(1), 111–132.
- Fan and Han (2017) Fan, J. and Han, X. (2017). Estimation of the false discovery proportion with unknown dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4), 1143–1164.
- Fan et al. (2012) Fan, J., Han, X., and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. Journal of the American Statistical Association, 107(499), 1019–1035. PMID: 24729644.
- Giraud (2015) Giraud, C. (2015). Introduction to High-Dimensional Statistics. CRC Press, Boca Raton.
- Hoeffding (1963) Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58, 13–30.
- Lange (2010) Lange, K. (2010). Numerical Analysis for Statisticians. New York, Springer.
- Müller et al. (2004) Müller, P., Parmigiani, G., Robert, C., and Rousseau, J. (2004). Optimal sample size for multiple testing: the case of gene expression microarrays. Journal of the American Statistical Association, 99(468), 990–1001.
- Newey (1991) Newey, W. K. (1991). Uniform Convergence in Probability and Stochastic Equicontinuity. Econometrica, 59, 1161–1167.
- Rudelson and Vershynin (2013) Rudelson, M. and Vershynin, R. (2013). Hanson-Wright Inequality and Sub-Gaussian Concentration. Electronic Communications in Probability, 18, 9.
- Sarkar et al. (2008) Sarkar, S. K., Zhou, T., and Ghosh, D. (2008). A general decision theoretic formulation of procedures controlling FDR and FNR from a Bayesian perspective. Statistica Sinica, 18(3), 925–945.
- Shalizi (2009) Shalizi, C. R. (2009). Dynamics of Bayesian Updating with Dependent Data and Misspecified Models. Electron. J. Statist., 3, 1039–1074.
- Storey (2003) Storey, J. D. (2003). The positive false discovery rate: a Bayesian interpretation and the q-value. Ann. Statist., 31(6), 2013–2035.
- Sun and Cai (2009) Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2), 393–424.
- Xie et al. (2011) Xie, J., Cai, T. T., Maris, J., and Li, H. (2011). Optimal false discovery rate control for dependent data. Statistics and its interface, 4(4), 417.